Download - Detecting and tracking outdoor gym geometry for AR display ...

Detecting and tracking outdoor gym geometryfor AR display of exercise suggestions

Aalto University

Supervisor: Perttu HamalainenAdvisor: Tuure Saloheimo

Aina Maki

May 2021

Abstract

Exercising the body is a fundamental part of health. There are many re-sources freely available, such as audio-visual content on the internet, simplynature itself or outdoor gym structures in public parks. Using these resources,it has been proposed to implement a mobile application to encourage physicalactivity, using augmented reality to engage the user, providing a different andmotivating experience that encourages them to explore and try new forms ofexercise. This thesis starts with the implementation of a mobile applicationthat detects and tracks outdoor gym structures, training neural networks tosegment and obtain the pose of the structures and project exercise sugges-tions using augmented reality. However, during the project we discoveredthe unfeasibility of such an objective due to the fact that the platform usedto develop the project, Unity, is still in the process of evolving with respectto deep learning technologies.

In these circumstances, this thesis documents the different steps that havebeen taken, analyses the unfeasibility of the project and proposes a study toimprove the pose estimator.

1

Acknowledgements

I would like to give special thanks to my supervisor Perttu Hamalainen, foroffering such an interesting thesis topic and for his guidance through thewhole project,I also feel grateful for his effort in building a nice working environment,doing weekly individual and group meetings, sharing interesting facts andthe progress in work, which motivated me and pushed me to work harder onmy own topic at the same time as getting to know other projects.

I would also like to thank Tuure Saloheimo, who acted as my advisor dur-ing the first months.Thank you for your assistance and for showing yourselfalways available.

Finally, very warm thanks for the support of my friends, specially thefriendships built here in Finland. Tobbia Katryn, Constantine, Charles,Nina, Bruno, Nico and Jordi, your energy kept me motivated and mademe enjoy this whole experience in another level.

Without the assistance of the aforementioned people, this thesis would havenot been possible.

2

Contents

1 Introduction 51.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Research approach . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . 6

2 Background and related work 72.1 6D Pose estimation . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Traditional methods . . . . . . . . . . . . . . . . . . . 82.1.2 Learning-Based approaches . . . . . . . . . . . . . . . 10

2.2 Deploying Deep Learning in Real-World Applications . . . . . 122.2.1 Unreal . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.2 Unity . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Augmented reality . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Methodology 183.1 Starting points: Detecting and tracking playground structures 183.2 Lappset-Base . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Generator . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Segmentator . . . . . . . . . . . . . . . . . . . . . . . . 193.2.3 Pose estimator . . . . . . . . . . . . . . . . . . . . . . 193.2.4 Triton cluster . . . . . . . . . . . . . . . . . . . . . . . 193.2.5 Lappset-Base execution . . . . . . . . . . . . . . . . . . 20

3.3 Feasibility study: Unity Barracuda and Mobile AugmentedReality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.1 Unity Barracuda with Lappset-Base . . . . . . . . . . . 233.3.2 Augmented Reality in Unity . . . . . . . . . . . . . . . 23

3.4 Pose estimator analysis . . . . . . . . . . . . . . . . . . . . . . 273.4.1 Adaptation process . . . . . . . . . . . . . . . . . . . . 28

3

4 Results 294.1 Lappset-Base pose estimator . . . . . . . . . . . . . . . . . . . 294.2 Pose estimator adaptation . . . . . . . . . . . . . . . . . . . . 30

5 Conclusions 31

4

Chapter 1

Introduction

1.1 Motivation

Exercising and maintaining a healthy lifestyle is fundamental to have a whole-some life. Moreover, some of the barriers that might prevent some peoplefrom achieving such goal can be economical, meaning not being able to affordgym subscription or joining sport clubs, lack of interest, lack of knowledgein the different ways of exercising the body, among others.

With the recent quarantine due to the pandemic, the amount of resourcesto exercise at home have increased. There is a lot of media content in theinternet with different exercises routines, which has been really helpful tobalance the physical inactivity due to confinement. However, it is importantto perform the exercises correctly and with good posture. Otherwise it canbe counterproductive, leading to back injuries, cartilage wear, etc.

The aim of this project is to offer a convenient, easy and visual aid forexercising in outdoor gyms, implemented as a mobile Augmented Reality ap-plication that tracks outdoor gym structures in the camera view and displaysanimated exercise suggestions.

The form of a mobile application is appropriate since the vast major-ity uses this tool in a daily basis and carries it everywhere. The detectionand tracking in real time can offer a straightforward experience and the aug-mented reality projection offers a detailed exhibition of the exercise, allowingto go around 360 degrees and see all the parts of interest.

5

1.2 Research approach

The starting point of the thesis was an initial deep learning-based outdoorgym structure pose estimation codebase developed by the thesis advisor.This was to be integrated with the Unity game engine and a functional mo-bile Augmented Reality application, and then improved if there was stilltime remaining. The author first conducted a feasibility study of integratingsimple deep learning models in Unity and mobile AR development. Then,integrating the full model was attempted. However, this provided ultimatelyinfeasible, due to the chosen software platform lacking support for some deeplearning operators needed by the model. After this, the remaining time wasused for investigating and testing better pose estimation models in a desk-top software environment. A literature review of suitable pose estimationmethods was also conducted. The original goal was not reached, but the ex-periments and the literature review should still be useful for others continuingthis work.

1.3 Structure of the thesis

The other sections of the thesis are organized as follows: Chapter 2 introducesthe concepts necessary to understand the overall framework, as well as someprojects that address similar issues. Then we have chapter 3, Methodology.It describes in detail the procedure to execute the desired application, withits corresponding parts: Lappset-Base and Feasibility study. It describes theproblem we encountered and why it was decided to approach the projectdifferently. Therefore, it also contains a study of the pose estimator, in orderto make comparisons and try to achieve a better implementation. Followingthis part we have chapter 4, the results obtained from both the Lappset-Baseand the comparisons between pose estimators. And finally, chapter 7 is theconclusions we have reached after carrying out the project.

6

Chapter 2

Background and related work

2.1 6D Pose estimation

In this section we will introduce the concept of 6 DoF (degrees of freedom)pose for non-deformable objects such as the playground structures we use asour data. We will detail the different existing and most common methods,their different performances and the challenges we may encounter.

6 DoF pose refers to the location and orientation of an object. It is definedby means of rotation and translation transformations between the referencepose and the observed object’s pose.

This information is useful and applied in many fields, such as autonomousdriving, SLAM (Simultaneous Localization and Mapping), augmented realityand many more.

In the case of augmented reality, which is the one we are interested infor this project, pose estimation is crucial since it is necessary to know thelocation and orientation in order to add the virtual scene in the correctposition.

6D pose can be retrieved from texture information, geometric information,color information and recently also depth information.

There are several methods for pose estimation, and we will discuss themmoving from traditional computer vision methods to learning based methodsthat use deep neural networks and have emerged as the dominant paradigmover the past few years.

7

2.1.1 Traditional methods

The essence of these methods is to calculate the key points or key featuresand match the input image with the most similar image in the dataset ac-cording to the key points or key features. We can differentiate between 2Dinformation-based and 3D information-based approaches.

2D information-based approach

2D information-based approaches are more convenient since the informationcan be obtained more easily by simpler devices such a basic color camera.There are several methods that extract the position of an object by focusingon one of the features, for instance scale-invariant feature transform (SIFT)[1]and speeded up robust features[2] (SURF) are the early and classical featuresfor pose estimation based on the texture of objects. There are also methodsrelying on the geometric information for when the object is textureless.

For training and implementing machine learning algorithms in a projectin Unreal, there is an Unreal Engine plugin for TensorFlow, tensorflow-ue4.

Historically, pose estimation was first studied in synthetic computer-generated images, from where research progressed to real/natural images.Synthetic images are often more simple, and allow researchers to controlthings like texture and lighting.

A very common method to retrieve the pose is epipolar geometry. If wehave two images of an object captured at different camera positions, thereare geometric relationships between the 3D points and their projections onthe 2D image plane that allow us to estimate the pose of the object.

Even so, when the image has occlusions it is difficult to obtain the geo-metrical relationships.

An approach that deals with occlusions and even clutter and illuminationis the one using similarity transformations in real time[3]. The matching isperformed based on the maxima of the similarity measure in the transfor-mation space. For normal applications, subpixel-accurate poses are obtainedby extrapolating the maxima of the similarity measure from discrete sam-ples in the transformation space. For applications with very high accuracyrequirements,least-squares adjustment is used to further refine the extractedpose.

We can also mention an approach that is robust for textureless objects.

8

The Fast 6D pose estimator[4] uses a coarse initialization and estimates thepose by using edge correspondences, where the similarity measure is encodedusing a pre-computed linear regression matrix.

Also, Gradient Response Maps for Real-Time Detection of TexturelessObjects[5] proposed a method including a novel image representation fortemplate matching designed to be robust to small image transformations. Ituses the gradient orientation of the edges of objects to create templates.

3D-Information-Based Approaches

3D information is more convenient for measuring 6D pose since it preservesthe original appearance of the object. Therefore, it is more robust and canobtain more accurate results. In addition, thanks to the evolution of tech-nology there is more and more methods to acquire 3D scenes, such as depthcameras and 3D scanners.

We can differentiate between Matching-Based Approaches and Local De-scriptor Approaches.

The aim of matching-based approaches is to search for the most sim-ilar template in the dataset and return the 6D pose of the template. One ofthe challenges that we face with these kind of methods is that 3D originaldata is always too large, and therefore, it is computationally expensive.

Consequently, there are many preprocessing methots proposed to reducethe complexity of the task.

• [6] Extraction of keypoints in the template and preserve those withmore information, such as the points of edges.

• [6] Using a 2D/2.5D object detector for scene point clouds. YOLO isused to segment the scene point cloud with 2D bounding boxes due totheir lower time consumption.

• [7] Combined PCOF-MOD (multimodal PCOF), balanced pose tree(BPT), and optimum memory rearrangement into 6D pose estimationto optimize data storage structure and lookup speed.

• [8] Proposed a novel multi-task template matching (MTTM) frameworkthat finds the nearest template of a target object from an image whilepredicting segmentation masks and a pose transformation between thetemplate and a detected object in the scene using the same feature mapof the object region.

9

On the other hand, local descriptor approaches define and calculatea global descriptor on the model offline. The global descriptor should beinvariant with respect to rotation and translation. Then, the local descriptoris calculated and matched with the global descriptor online.

RGB-D-based learning approaches combine color information and depthinformation to achieve a more accurate and robust 6D pose result.RGB-D-based learning approaches combine color information and depth informationto achieve a more accurate and robust 6D pose result.

2.1.2 Learning-Based approaches

Given a large set of image data of the object in different poses, learning-basedsystems learn the mapping from 2D image features to pose transformation.Then, when a new image is given, the system is able to estimate the objectpose. The performance of learning-based methods is good. Nevertheless, wehave to take into consideration that it requires plenty of training data andtime to train the model and also, it won’t have a good estimation for posesthat do not exist in the dataset, since it won’t have learned from it.

We can classify the approaches into three categories: keypoints based,holistic and RGB-D.

Keypoints based extract 2D-3D point pairs and then use PnP to cal-culate the 6D pose of an object. It is a two-step based approach where wefirst extract the 2D feature points in the input image and then regress the6D pose using a PnP algorithm.

From keypoint-based approach we can mention BB8[9]. It uses CNN topredict the 2D projections of the eight 3D bounding box points of the object.Then, to deal with textureless symmetrical objects, it restricts the range ofthe rotation angle of the training data and uses a classifier to predict therotation angle during the estimation step. Moreover, one of the challengesthat faces is when the object is partially invisible, in which case there aresome difficulties in retrieving the correct 3D bounding box.

To solve this, one of the approaches has been to segment the image intopatches and make predictions of which object they belong to and the 2Dprojection. Subsequently, the patches estimated as the same object are com-bined and the 6D pose is measured with PnP. This method performs welleven with occlusions, since each visible part of the objects contributes a local

10

pose prediction in the form of 2D keypoint locations. [10]

Another method that deals well occlusions is using PVNet[11] to detect2D keypoints in a RANSAC-like fashion, which robustly handles occludedand truncated objects. The RANSAC-based voting also gives a spatial prob-ability distribution of each keypoint, allowing the estimation of the 6D posewith an uncertainty-driven PnP.

We can also mention a single stage approach[12] that directly regresses6D poses on the basis of groups of 3D-to-2D correspondences associated witheach 3D object keypoint

Holistic approach uses an end-to-end structure to measure the 6D pose,which is faster and more robust than keypoints-based approaches.

We can mention PoseNet[13], a robust real-time monocular six degree offreedom relocalization system. With a 23 layer deep convnet, it managesto solve complicated image plane regression problems with only few dozentraining examples. It is robust with difficult lightning, motion blur anddifferent camera intrinsics where point based SIFT registration fails.

We can also cite SSD[14] (Single Shot MultiBox Detector) and SSD-6D[15]. SSD was the first method to associate bounding box priors withthe feature maps of different spatial resolutions in the network that was ableto detect objects in images using a single deep neural network. SSD-6D isan extention of the above-mentioned model that trains on synthetic modeldata only.

RGB-D-based learning approaches combine color information anddepth information to achieve a more accurate and robust 6D pose result.

For instance, DenseFusion [16] is a two-stage approach where first a het-erogeneous network is used to deal with the RGB data and point cloud data,saving the original structure. Then, in the second stage, a full convolutionalnetwork is used to map each pixel in RGB crop to colored feature spaceand uses a network based on PointNet to map each point in the point cloudto geometrical feature space. Then it merges the feature points in the col-ored feature space and the geometrical feature space and outputs a 6D pose

11

estimation result. In addition, it finally refines the result by loop learning.

Also, G2L-Net [17] is a three-step real-time 6D object pose estimationframework. Firstly, it extracts the object point cloud from the RGB-D im-age by 2D detection. Then, 3D segmentation and translation prediction ismade by feeding the object point cloud to a translation localization network.Finally, with the predicted segmentation and translation, the point cloud istransferred into the local canonical coordinate to estimate the initial objectrotation by training a rotation localization network.

In the following table we can see a comparison between the differentapproaches we have listed.

Figure 2.1: Table 1. Comparison between approaches

2.2 Deploying Deep Learning in Real-World

Applications

Training deep learning systems is nowadays fairly straightforward using Pythontools like Pytorch and open source code repositories that are often releasedwith research papers. However, this is not yet enough to integrate the tech-nology with practical applications. For instance, interactive applications areoften developed in other languages and even if a deep learning library’s APIis available for other languages, the code might not compile for all platforms.Mobile augmented reality development, in particular, requires complex tech-nology stacks and developers often work with general purpose interactivesoftware platforms like the Unreal and Unity game engines. Below, we re-view what options there are for getting one’s deep neural networks to run insuch contexts.

12

2.2.1 Unreal

Unreal[18] is an advanced real-time 3D creation tool. Original purposes werestate-of-the-art game engine, but it has become a platform where creatorsacross industries have the freedom and control to deliver cutting-edge con-tent, interactive experiences, and immersive virtual worls.

For training and implementing machine learning algorithms in a projectin Unreal, there is an Unreal Engine plugin for TensorFlow, tensorflow-ue4.

The plugin contains C++, Blueprint and python scripts that encapsulateTensorFlow operations as an Actor Component.

However, there is an important limitation, and that is that there is cur-rently only a working build for the Windows platform.

2.2.2 Unity

Unity is a game engine that allows creating games in both 2D and 3D format.The platform supports nearly all the device in existence.It has a C# scripting API and one of the things that makes it appealing

is the integration of tools to develop easily in virtual and augmented reality.This added to the fact that this platform also incorporates resources to

build neural networks has made it the perfect platform to develop the projecton.

ONNX - Open Neural Network Exchange

Unity can import trained models and apply inference with ONNX.ONNX is an open format built to represent machine learning models. [19]It defines a common set of operators - the building blocks of machine

learning and deep learning models - and a common file format to enableAI developers to use models with a variety of frameworks, tools, runtimes,and compilers. [20] Basically, it enables interoperability between differentframeworks.

Using this tool we can import the segmentator model and the pose esti-mator model into Unity3.

Unity Barracuda

In order to import the onnx models into Unity we use the Barracuda library.

13

Unity Barracuda [21] is a lightweight and cross-platform Neural Net in-ference library for Unity. Barracuda can run Neural Nets both on GPU andCPU. Moreover, currently Barracuda is in the preview development stage.Here are some projects that use Barracuda in Unity for object detection andclassification with YOLOv2:

• YOLO-UnityBarracuda [22]

The project imports an already trained YOLOv2 model in onnx formatand applies inference. It is supported mainly by mobile devices usinga camera as input.

As it can be seen in figure 2.2, it detects and classifies the objectsmanifested in the screen.

Figure 2.2: Object and human detection and classification in Unity

• - TinyYOLOv2Barracuda [23]

14

It shows how to run the YOLOv2 object detection system on the UnityBarracuda neural network inference library. It uses webcam or UVC-compliant video capture device as input.

After detection, there is a pixelation effect to the detected person re-gions.

Figure 2.3: Pixelation in the areas detected as humans

2.3 Augmented reality

More and more technology is getting involved in our daily lives and easesmany aspects of it. Since the massification of media content, augmentedreality has gained increasing interest in numerous industries.

Augmented Reality [24] aims at simplifying the user’s life by bringingvirtual information not only to his immediate surroundings, but also to anyindirect view of the real-world environment, such as live-video stream. ARenhances the user’s perception of and interaction with the real world.

15

One clear example of Augmented Reality can be the recent applicationthat IKEA launched, which permits the customer to visualize in real timethe different products that they offer by projecting it onto the screen of thedevice, where they can see in real life how the product would fit in the chosenposition of the establishment.

Figure 2.4: Ikea’s augmented reality application to showcase its products

As it can be seen in figure 2.4 the background is the room that the deviceis recording, and the different armchairs are projected with augmented realityon the screen of the device.

Sometimes the terms Augmented Reality and Virtual Reality are con-fused.

Virtual Reality [25] is the illusion of participation in a synthetic environ-ment rather than external observation of such an environment. VR relies onthree-dimensional (3D), stereoscopic, multisensory experience.

It requires a particular collection of technological hardware, includingcomputers, head-mounted displays, headphones and motion-sensing glovesor devices. [26]

16

Figure 2.5: Virtual Reality

As it can be seen in figure 2.5, the man is interacting inside a simulatedenvironment using VR glasses and motion sensors.

The AR part of the project would correspond to the projection in themobile phone screen of a human mannequin performing exercise suggestionsin the detected gym structure.

17

Chapter 3

Methodology

3.1 Starting points: Detecting and tracking

playground structures

The thesis is the continuation of the thesis advisor’s work, who providedLappset-Base [27], the implementation of a model that, given the blender fileof a gym structure, in this case a climbing rig, trains two neural networks forestimating the pixel mask and the 6D object pose.

3.2 Lappset-Base

The codebase is constituted of 3 blocks.

Generator: Generates simulated training images for training the seg-mentator and the 6D object pose estimator.

Segmentator: Training of a neural network for detecting object masksof climbing rigs given source images and ground truth object masks.

Pose estimator: Prediction of the 6D object pose (translation and rota-tion) of a gym structure, given an image and the prediction of object mask.

3.2.1 Generator

The aim of this block is to create sufficient data to be able to train thesegmentator and the pose estimator. We use the blender files of the gym

18

structures and some background images as inputs. The generator then cre-ates synthetic data by combining the gym structure and the backgroundimage and randomizing some parameters such as camera position, structurerotation, camera rotation, angle and intensity of the sun. As output we ob-tain as many images as requested, and a JSON file that contains the 6Dobject pose of each generated image.

3.2.2 Segmentator

The aim of this block is to obtain the model that detects object masks of thegym structure with the best IOU (Intersection over Union) score.

With the images generated by the generator, separated in training andvalidation files, and the object masks’ ground truth, we train the segmen-tator and obtain the model that retrieves better the pixel mask of the gymstructure.

3.2.3 Pose estimator

The aim of this block is to obtain the model that estimates better the 6Dobject pose of the gym structure. The inputs correspond to the object masks,separated in training and validation files, and also the corresponding JSONfiles containing the ground truth of the 6D object pose.

After executing the 3 blocks, we obtain the best trained model from thesegmentator and from the pose estimator.

The objective is to import these models in Unity3D in order to applyinference and be able to detect the gym structure and the object pose.

Now we will focus on the parties involved in the execution of the system.

3.2.4 Triton cluster

Triton[28] is the Aalto high-performance computing cluster, with severalthousand CPUs, 2PB storage capacity, Infiniband network, and many GPUsfor deep learning and artificial intelligence research.

These GPUs have been used for the training of the segmentator and thepose estimator.

19

3.2.5 Lappset-Base execution

In this section we will describe the main steps to run the system.

1. Add background images and scene file with the blender figures of thegym structures in the project. In order to get images that resemblethose of the real world, the background images have been extractedfrom a website that offers HDR images. High Dynamic Range Imaging(HDRI) is a technique used to reproduce great range of luminosity,since every pixel of the image contributes to the lightning of the scene.When the generator is run, it generates images by combining the figurefrom the blender file with the background images, in different positions,luminosity, rotations, etc., as well as the corresponding mask image.

In addition, it generates a file containing the 6d poses of each image.

2. Generate data (in our case 2176 images)

(a) Generated real image (b) Generated mask image

3. Separation of 80 percent training (1740 images) and 20 percent valida-tion (436 images).

4. Train the segmentator and obtain the model with best IOU that re-trieves the masks from the images. The number of epochs used to trainthe segmentator was 50.

5. Train the pose estimator and obtain the best model. The number ofepochs used to train the segmentator was 50.

20

3.3 Feasibility study: Unity Barracuda and

Mobile Augmented Reality

After running Lappset-Base, we have obtained two pth files correspondingto the trained models of the segmentator and the pose estimator.

The idea is to convert these pth files into onnx format in order to im-port them into Unity and thus apply inference for unseen images within theplatform.

In order to understand onnx, we start with a simple binary classificationmodel. The model will predict the class to which a point belongs given ascatter plot coordinate. The general idea of the procedure is as follows [29]:

1. We generate two clusters with some overlapping.

Figure 3.2: Scatter plot of the 2 clusters to classify

2. We build the model and train it.

3. We export the model in onnx format as follows:

torch.onnx.export(model, # model being run

x[0].float(), # model’s input

"onnx_model.onnx", # model name

opset_version=11, # the ONNX version

input_names = [’x’], # the model’s input names

21

output_names = [’y’] # the model’s output names

)

4. Go to Unity and install Barracuda package, which allows Unity tounderstand and treat onnx models.

5. Create an empty object and a C# script attached to it.

6. In the script we make the import of the model and describe the inputsand outputs. The basic script is the following:

(a) First we create an NNModel object.

(b) Then, we create a model variable where we load the NNModelobject and a worker variable.

var model = ModelLoader.Load(nnModel);

var worker = WorkerFactory.CreateWorker(WorkerFactory.Type.

ComputePrecompiled, model);

Worker is in charge of breaking down the model into executabletasks and schedules them on GPU or CPU.

(c) Then we define the inputs and the outputs. In this case, the inputswould be the coordinates of the points of which we want to knowthe class they correspond to. For instance, the definition for thecoordinates [0,0] would be:

var inputTensor = new Tensor(1, 2, new float[2] { 0, 0 });

worker.Execute(inputTensor);

The output is the class to which the algorithm classifies the po-sition inserted as the input. The definition of the output wouldbe:

var output = worker.PeekOutput();

Once the script is saved, the NNModel object we have created will appearin Unity and we can supply the onnx model to it.

When we press run in Unity, we get the expected results. To give acouple of examples, when we insert as input 0,0] we get as output 1, whichcorresponds to the yellow cluster, and when we insert as input [1,1] we getas output 2, the purple cluster.

22

3.3.1 Unity Barracuda with Lappset-Base

Now that we have a general idea of how Unity barracuda and onnx modelswork, let’s do the same with Lappset-Base models.

In this case, the model is much more complex. The segmentator modelhas been created with a pre-trained encoder. In addition, instead of havingas input an x and y coordinate value, we have a color image (3 channels) ofsize 320x320 and as output the mask of the climbing rig.

However, when trying to execute, Unity detects the neural layers but itis not able to process them as the methods for doing so are defined but notyet implemented, and therefore are empty.

After investigating the reason for this, since it had been seen that therewere other projects that ran complex neural networks without any issues,it was observed that Unity Barracuda has a specific implementation readyfor YOLOv2, and therefore, if the pre-trained model used is this one, thereis no problem. However, we have seen that Unity barracuda does not cur-rently offer a robust system for running and training independently designedmodels.

We have undertaken this project with the hypothesis that Unity’s Bar-racuda system provides tools for cross-platform Deep Learning, without cum-bersome compiling of various machine learning tools for all operating systemsand hardware configurations. However, deep learning technology platformsare still evolving and not completely mature. This fact has made us rethinkthe project, since the execution of the models on the platform is a funda-mental part of the process.

We changed the course of the project and our objective was set to improvethe performance of the pose estimator, looking for an implementation toadapt it and compare the results with those of the one we already have.

The process is explained in detail in section 3.4 Pose estimator.

3.3.2 Augmented Reality in Unity

In this section we will describe the basic concepts of augmented reality inUnity, by creating a straightforward AR project that projects an image orfigure on a specified target to be detected.

For this part of the project we need Unity to have ”Vufforia Unity aug-mented reality support” installed. We can install Vuforia either at the timeof installation of the Unity program by selecting that we want this package

23

Figure 3.3: Caption

or once already installed using the package manager. In the second case,go to the Windows option in the project, select package manager and look forthe Vuforia Engine AR package and install it. If this procedure is followed,Vuforia will be installed in this project only. In case you create anotherproject, follow the same procedure and use the project manager to haveVuforia.

Vuforia is an augmented reality SDK (software development kit) [30] formobile devices, which facilitates many components within the AR applica-tion: AR recognition, AR tracking and AR content rendering.

The Vuforia SDK supports both 2D and 3D targets. It provides featureslike localized occlusion detections using virtual buttons, image target selec-tion in real time and has capability to reconfigure and create target sets de-pending on the scenario. Furthermore, it supports both native developmentfor iOS and Android, while enabling the development of AR applications inUnity that are easily portable to both platforms.

After the installation, we shall ensure that our software has a Vuforialicense key. [31]

To do this, go to the Vuforia website (https://developer.vuforia.com/),log in (or register if it is the first time) and go to the licence keys section

24

inside the Develop tab.Create a new key by clicking Get Development Key and then decide a

License Name. Once created, click on it and the licence key will be displayed.(insert license key image in vuforia website)

We then create a new 3D project in Unity. We install Vuforia with thePackage Manager if needed, and then we go to Window/VuforiaConfiguration.The Inspector window will open up and we will be able to add the licence keyin the corresponding field by copying and pasting from the Vuforia websiteto Unity.

Figure 3.4: License key insertion

Once everything is set up, we add an AR camera in the project (GameOb-ject/Vuforia Engine/AR Camera) which will be the viewpoint through werewe see all the assets of the game and delete the default camera that theproject imported.

The next thing to add is the Image Target (GameObject/Vuforia En-gine/Image), which is the image that it’s going to be recognised in the realworld and were we will add some AR content.

When it has been added, we will be able to see the default Unity image,but if we want a specific image, the procedure for modifying and adapting it

25

is detailed below.When you select Image Target, a section called ”Image Target Behaviours”

will appear in its Inspector window. In this section click ”Add target”, andthis will take you back to the Vuforia website.

On the web, we add a database in the Target Manager tab, were we willadd a target. We can choose the type of target from Single Image, Cuboid,Cilinder or 3D Object. For the purpose of this example, single Image isenough, but here we can see that for our project, we will use the option 3Dobject, and this will correspond to the climbing rig detected by the systemand where the figure will be projected making exercise suggestions.

We choose the image that will be the target and download the databaseto import it into Unity.

Now that we have defined the image target, we can add images or figuresthat will be projected with augmented reality. For example, a cube on topof the image target.

The procedure is simple, just create a 3D cube-shaped object linked tothe image target and place it in the desired position with respect to the imagetarget.

Figure 3.5: AR testing. We can see on the computer screen that the cube isprojected on top of the image target.

26

3.4 Pose estimator analysis

In this section we describe the research of different pose estimators withimplementation and the process of adaptation to the project in order to run itin the project and make comparisons of results and propose an improvementwith respect to the existing one.

The chosen implementation is Real-Time Seamless Single Shot 6D ObjectPose Prediction[32].

It is a real-time 6D pose estimator that proposes a single-shot approachfor simultaneously detecting an object in an RGB image and predicting its6D pose. The CNN architecture is as follows described in the figure below.

Figure 3.6: Overview: (a) The proposed CNN architecture. (b) An exampleinput image with four objects. (c) TheS×Sgrid showing cells responsible fordetecting the four objects. (d) Each cell predicts 2D locations of the cornersof the projected 3D bounding boxes in the image. (e) The 3D output tensorfrom our network, which represents for each cell a vector consisting of the2D corner locations, the class probabilities and a confidence value associatedwith the prediction.

The network is designed to predict the 2D projections of the corners of the3D bounding box around our object using YOLO, and subsequently predict

27

the projections of the 3D bounding box corners in the image. Then, giventhese 2D coordinates and the 3D ground control points for the bounding boxcorners, the 6D pose can be calculated algebraically with an efficient PnPalgorithm.

3.4.1 Adaptation process

Adapting the implementation of the paper to include the outputs of oursystem as inputs was unfeasible. Thus, what has been done is to generatethe data with our system and adapt it so that it can be fed and trained tothe implementation.

The implementation has an array of 21 values as label input.The first value indicates the class, in which our case will always be 1 since

for the moment we only have one gym structure to detect.The following 3 values correspond to the x, y and z values of the centroid

coordinates.The following 16 numbers correspond to the x and y coordinates of the

eight corners of the object bounding box.Finally, the remaining 2 values correspond to the width and height of the

2D rectangle computed by projecting in 2D the 8 corners of the 3D boundingbox.

Our implementation has as outputs the rotation (in euler angles) andtranslation information.

In order to retrieve the 8 points of the bounding box after the transforma-tion was, first, retrieve the origin bounding box and obtain the transformedpoints values by:

1. Define the Rotation Matrix using the euler angles.

R =

cos(α) cos(β) cos(α) sin(β) sin(γ)−sin(α) cos(γ) cos(α) sin(β) cos(γ)+sin(α) sin(γ)sin(α) cos(β) sin(α) sin(β) sin(γ)+cos(α) cos(γ) sin(α) sin(β) cos(γ)−cos(α) sin(γ)

− sin(β) cos(β) sin(γ) cos(β) cos(γ)

2. Multiply each point with the Rotation Matrix and obtain the rotatedpositions.

3. Apply translations.

28

Chapter 4

Results

4.1 Lappset-Base pose estimator

In this section we have done a series of experiments obtaining the performanceof the pose estimator for different values of input images and number ofepochs.

Because the data must first be generated with the generator and thenadapted for input to the pose estimator, and because the models are trainedon the triton account, it has not been possible to extract typical plots toanalyze performance over the full range of number of images or epochs.

Figure 4.1: Table with loss and MSE values for different image numbers andepochs

29

4.2 Pose estimator adaptation

Although the files have been created as requested in the implementation, andthe data has been generated according to what they have in their project,internal errors in the code keep appearing and are still difficult to trace.Because of this, it has not been possible to compare the performance withthose of our own project.

30

Chapter 5

Conclusions

Although the goal was not reached, research has been carried out in manyareas and the knowledge gained from it has been vast. It has been a projectwhere many blocks were worked at the same time, so for instance, eventhough we discovered the impossibility of deploying deep learning in Unity,we have been able to work with augmented reality applications as well.

Throughout the project, it has been studied the execution flow of a climb-ing rig position detector and estimator. Even though the base-code was pro-vided by the supervisor, it has been a challenge to understand it and beingable of executing it. Following to this, it has been discovered that currentlyUnity is lacking support for deep learning development inside the platform.For future Unity versions, when the current empty methods get build andwe get the possibility to run the project in Unity, it will definitely be aninteresting project to come back to.

Knowledge has been gained in augmented reality applications in unityand some functional projects have been created. For the augmented realitypart, Unity has a lot of documentation and tools that allow creating reallyinteresting content. This part of the project has definitely been of interestand it has been a discovery to realise that these kind of gaming engines areso at hand.

Finally, in-depth research has been done on pose estimators.The original goal was not reached, but the experiments and the literature

review should still be useful for others continuing this work.

31

Bibliography

[1] G Lowe. “Sift-the scale invariant feature transform”. In: Int. J 2.91-110(2004), p. 2.

[2] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. “Surf: Speeded uprobust features”. In: European conference on computer vision. Springer.2006, pp. 404–417.

[3] Carsten Steger. “Occlusion, clutter, and illumination invariant objectrecognition”. In: International Archives of Photogrammetry RemoteSensing and Spatial Information Sciences 34.3/A (2002), pp. 345–350.

[4] E. Munoz et al. “Fast 6D pose estimation for texture-less objects froma single RGB image”. In: 2016 IEEE International Conference onRobotics and Automation (ICRA). 2016, pp. 5623–5630. doi: 10.1109/ICRA.2016.7487781.

[5] Stefan Hinterstoisser et al. “Gradient Response Maps for Real-TimeDetection of Textureless Objects”. In: IEEE Transactions on PatternAnalysis and Machine Intelligence 34.5 (2012), pp. 876–888. doi: 10.1109/TPAMI.2011.206.

[6] Yan Zhang et al. “6D Object Pose Estimation Algorithm Using Prepro-cessing of Segmentation and Keypoint Extraction”. In: 2020 IEEE In-ternational Instrumentation and Measurement Technology Conference(I2MTC). 2020, pp. 1–6. doi: 10.1109/I2MTC43012.2020.9128980.

[7] Yoshinori Konishi, Kosuke Hattori, and Manabu Hashimoto. “Real-Time 6D Object Pose Estimation on CPU”. In: 2019 IEEE/RSJ Inter-national Conference on Intelligent Robots and Systems (IROS). 2019,pp. 3451–3458. doi: 10.1109/IROS40897.2019.8967967.

32

[8] Kiru Park et al. “Multi-Task Template Matching for Object Detec-tion, Segmentation and Pose Estimation Using Depth Images”. In: 2019International Conference on Robotics and Automation (ICRA). 2019,pp. 7207–7213. doi: 10.1109/ICRA.2019.8794448.

[9] Mahdi Rad and Vincent Lepetit. “BB8: A Scalable, Accurate, Robustto Partial Occlusion Method for Predicting the 3D Poses of ChallengingObjects Without Using Depth”. In: Proceedings of the IEEE Interna-tional Conference on Computer Vision (ICCV). Oct. 2017.

[10] Yinlin Hu et al. “Segmentation-Driven 6D Object Pose Estimation”.In: Proceedings of the IEEE/CVF Conference on Computer Vision andPattern Recognition (CVPR). June 2019.

[11] Sida Peng et al. “PVNet: Pixel-Wise Voting Network for 6DoF Pose Es-timation”. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition (CVPR). June 2019.

[12] Yinlin Hu et al. “Single-Stage 6D Object Pose Estimation”. In: Proceed-ings of the IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR). June 2020.

[13] Alex Kendall, Matthew Grimes, and Roberto Cipolla. “PoseNet: AConvolutional Network for Real-Time 6-DOF Camera Relocalization”.In: Proceedings of the IEEE International Conference on ComputerVision (ICCV). Dec. 2015.

[14] Wei Liu et al. “Ssd: Single shot multibox detector”. In: European con-ference on computer vision. Springer. 2016, pp. 21–37.

[15] Wadim Kehl et al. “SSD-6D: Making RGB-Based 3D Detection and6D Pose Estimation Great Again”. In: Proceedings of the IEEE Inter-national Conference on Computer Vision (ICCV). Oct. 2017.

[16] Chen Wang et al. “DenseFusion: 6D Object Pose Estimation by Iter-ative Dense Fusion”. In: Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition (CVPR). June 2019.

[17] Wei Chen et al. “G2L-Net: Global to Local Network for Real-Time6D Pose Estimation With Embedding Vector Features”. In: Proceed-ings of the IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR). June 2020.

[18] Inc. Epic Games. Unreal Engine. 2020. url: https://www.unrealengine.com/en-US/.

33

[19] The Linux Foundation. ONNX. 2019. url: https://onnx.ai/index.html.

[20] Microsoft Corporation. What is ONNX? 2020. url: https://microsoft.github.io/ai-at-edge/docs/onnx/.

[21] Unity Technologies. Unity Barracuda. 2020. url: https : / / docs .

unity3d.com/Packages/[email protected]/manual/index.

html.

[22] wojciechp6. YOLO-UnityBarracuda. 2020. url: https : / / github .

com/wojciechp6/YOLO-UnityBarracuda.

[23] Keijiro Takahashi. TinyYOLOv2Barracuda. 2021. url: https://github.com/keijiro/TinyYOLOv2Barracuda?fbclid=IwAR1MHfD39ecZnNvSrK-

fFODdV5rfnhERIRiJrKveyND15T%20QKbhlY%20OM%20niW4.

[24] Julie Carmigniani and Borko Furht. “Augmented reality: an overview”.In: Handbook of augmented reality (2011), pp. 3–46.

[25] Rae A Earnshaw. Virtual reality systems. Academic press, 2014.

[26] Jonathan Steuer. “Defining virtual reality: Dimensions determiningtelepresence”. In: Journal of communication 42.4 (1992), pp. 73–93.

[27] Tuure Saloheimo. Lappset-Base. 2020. url: https://github.com/ThetaNord/Lappset-Base.

[28] Aalto Science-IT. Triton cluster. 2020. url: https://scicomp.aalto.fi/triton/.

[29] Andre Pereira. How to use Pytorch models in Unity. 2020. url: https://medium.com/@a.abelhopereira/how-to-use-pytorch-models-

in-unity-aa1e964d3374.

[30] Dhiraj Amin and Sharvari Govilkar. “Comparative study of augmentedreality SDKs”. In: International Journal on Computational Science &Applications 5.1 (2015), pp. 11–26.

[31] Jonathan Linowes and Krystian Babilinski. Augmented reality for de-velopers: Build practical augmented reality applications with unity, AR-Core, ARKit, and Vuforia. Packt Publishing Ltd, 2017.

[32] Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. “Real-Time SeamlessSingle Shot 6D Object Pose Prediction”. In: CVPR. 2018.

34