Enhancing user experience by data processing using wearable ...

Arya Ghodsi

wearable computing devicesEnhancing user experience by data processing using

Academic year 2015-2016Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Rik Van de WalleDepartment of Electronics and Information Systems

Master of Science in de ingenieurswetenschappen: computerwetenschappen Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Ir. Jonas El Sayeh Khalil, Ignace SaenenSupervisors: Prof. dr. Peter Lambert, Prof. dr. ir. Rik Van de Walle

Abstract

Enhancing user experience by data processing using wearablecomputing devices

by Arya GhodsiMaster’s dissertation submitted in order to obtain the academic degree of Mas-

ter of Science in de ingenieurswetenschappen: computerwetenschappenAcademic year 2015-2016Ghent UniversityFaculty of Engineering and Architecture Department of Electronics and Informa-tion Systems Chairman: Prof. dr. ir. Rik Van de WalleSupervisor: Prof. dr. Peter Lambert, Prof. dr. ir. Rik Van de WalleCounsellors: Ir. Jonas El Sayeh Khalil, Ignace Saenen

Summary With mobile and wearable devices gaining market share day by day,one can wonder about the possibilities with these devices.This dissertation investigates whether these kind of devices are capable for userexperience enhancement using data processing.We will use one of these devices in a practical use-case where we will show addi-tional media corresponding to the pages they are reading.There are however some elements that may make impose difficulties on the process.

Keywords: wearable devices, natural language processing, ocr

ii

Enhancing user experience by data processing usingwearable computing devices

Arya Ghodsi

Supervisor(s): Prof. dr. Peter Lambert, Prof. dr. ir. Rik Van de Walle, Ir. Jonas El Sayeh Khalil, IgnaceSaenen

Abstract—Since mobile and wearable devices are gaining market shareday by day, one can wonder about new application areas for these devices.This dissertation investigates whether these kinds of devices are capable ofenhancing the user experience using data processing.We will use one of these devices in a practical use-case where we will showadditional media corresponding to the pages they are reading.There are however some elements, that impose difficulties on the process.We will set up a use case which will allow us to test whether wearable devicesare able to handle the data processing and enhance the user experience.

Keywords—wearable devices, natural language processing, OCR

I. INTRODUCTION

SINCE wearable computers are becoming more prominent ineveryday life, it must be considered how these devices can

be used to enhance everyday life and how they can become morethan just a gadget. These devices have a lot of potential basedon how people use them. They do not have to reach for theirpockets or purses and with one swift gesture they can receivea lot of information at once. Within the category of wearablecomputers we can distinguish several categories such as smartwatches, wearable devices which are placed on the head of theuser and wearable devices that are designed as activity trackingdevices.This dissertation will focus on devices such as Google Glass1,ReconJET glasses2, Epson Moverio3 etc.

In this dissertation we will test the Epson Moverio to seewhether it is suitable for showing additional media to enhancethe user’s experience.We have selected a use-case of reading books where the user ispresented with additional media content based on the book he isreading.

To this purpose we have selected several media channels,which we will discuss later on.In order to know which media to show, one first needs to knowwhich book the user is reading. Assuming that we have no priorknowledge of this, the only possible solution is to try to de-duce this information from pictures of the pages in the book.Indeed, we need to translate the images taken with our deviceto machine-readable text. This can be accomplished by usingOCR. Once this has been done, we need to understand the textwe have gained from the OCR.There are multiple ways to analyse text using natural languageprocessing. We have performed several tests to select the op-

1http://www.google.es/glass/start/2http://www.reconinstruments.com/products/jet/3http://www.epson.com/cgi-bin/Store/jsp/Landing/

moverio_developer-program.do

timal settings while checking if the aforementioned device iscapable of performing the tasks.

We have developed a prototype client application which iscapable of showing media to the user, while the computationallyintensive work is performed by a web server. All these media arebased on the context, i.e. the page the user is reading.In the next section we will give an overview of related work. InSection III, we will explain the setup and highlight some of theimplementation details. In section IV our experiments and theirresults are described. Lastly, we will point out some issues forfuture work in Section V and conclude in Section VI.

II. RELATED WORK

This section will highlight some of the already existing litera-ture concerning the three main areas of this dissertation. We willstart by describing how current development of OCR-softwareworks. Next, we will discuss semantic reasoning. Lastly, wewill describe literature dealing with the small display resolutionof the category of devices we are investigating in this disserta-tion.

A. OCR

Every major OCR technology on the market right now doesroughly the same. Given a figure as input, first the OCR soft-ware will do some preprocessing, after that the layout will beanalysed. Once the layout has been analysed, the software triesto recognize text lines and words (in images). For the next step,in which the software will try to recognize words (in linguis-tic terms), multiple approaches are possible. In the last stepmachine-understandable text is outputted. An overview of thesesteps can be found in Figure 1.

Fig. 1. Usual flow of OCR software.

We will discuss two major OCR technologies within the

OCR-scene: Tesseract and OCRopy (formerly known as OCRo-pus).

A.1 Tesseract

Tesseract is an open-source OCR engine. It was developedby HP between 1984 and 1994. Tesseract had a significant leadin accuracy over the commercial engines, but in 1994 the devel-opment stopped completely. However in 2005 HP made Tesser-act open-source and it has been continued by Ray Smith (fromGoogle Inc.)4 [1].

Tesseract uses a traditional step-by-step pipeline for process-ing, some of these steps were peculiar back in the days in whichTesseract flourished. However due to its ties with HP, Tesseractnever had its own page layout analysis and it relied on a pro-prietary page layout analysis of HP. In the first step the outlineof the components is stored using connected component analy-sis. Although this step was computationally expensive, it had anadvantage: when inspecting the nesting of outlines, the numberof child and grandchild outlines, it was easier to detect inversetext. At the end of this stage outlines were combined by nesting,hence creating Blobs. In the second stage Blobs were organizedinto text lines and the lines and regions were analysed and thelines are broken into words. After this step a two-step processwould follow for recognition. In the first pass the engine triestries to recognize each word, passing each recognized word to anadaptive classifier as training data. This adaptive classifier thentries to recognize words with a higher accuracy. Given that thisadapter may have learned useful elements just a little too late, asecond pass is performed on the page. Finally the engine tries toresolve fuzzy spaces and checks for alternative hypotheses forthe x-height in order to locate small-cap text. This architecturecan be found in Figure 2[1].

Fig. 2. Architecture of the open-source OCR engine Tesseract.

A.2 OCRopy (formerly OCRopus)

OCRopus puts emphasis on modularity, easy extensibility andreuse. In [2] the author goes on to discuss the main propertiesof the OCR-software.The architecture of OCRopus is actually almost the same as inFigure 1. The architecture is built in such a way that there is nobacktracking, the software moves strictly forward and consistsof three major steps:

(Physical) layout analysis In this step the software tries to iden-tify text columns, blocks, lines and the reading order.Text line recognition This step is accountable for recognizingthe text in a given line. OCRopus is designed in such a waythat it is able to recognize vertical and right-to-left text. Fur-

4The project is now available at https://code.google.com/p/tesseract-ocr/

thermore, a hypothesis graph is constructed showing possiblealternative recognitions.Statistical language modelling The last step consists of inte-grating the alternative recognitions from the previous step usingprior knowledge about the the language, domain, vocabulary etc.

Both these OCR technologies are used in experiments to de-termine which is the best suitable in our case.

B. Semantic Reasoning

This subject can also be seen as a part of the Natural Lan-guage Processing (NLP) area, which has been discussed for along time in scientific research.In NLP one tries to analyse text and establish human-like lan-guage processing using machines.This happens using a range of (linguistic) theories as well asmachine-involving technologies in order to achieve human-likelanguage processing. The latter term means that a machine isable to do the following [3], [4]:

1. Summarize a given text.2. Translate from one language to another.3. Ability to derive answers from questions concerning the con-text of a given text.4. Make conclusions from a given text.

Although we are not interested in all of the aforementioneditems, number 3 and 4 are of interest for our research. Onecould manually try to deduce information from a given text; forthis Linked Data (and Semantic Web) can be a great help.Linked Data simply put is a layer on top of the normal WorldWide Web as we know it. It allows to create typed links betweendata from different sources. In contrast to Web 2.0 mash-upswhich use fixed data sources, Linked Data operates on a global,unbound data space. [5], [6].

Semantic web is built upon principles that Berners-Lee gaveas guidelines in [6].

The Linking Open Data Project5 is a grassroots communitywhich tries to bootstrap the Web of Data by identifying alreadyexisting data sets which are available under open licenses andconverting them to RDF according to the above mentioned prin-ciples [5]. Figure 3 shows an impressive collection of data sets.According to statistics gathered by this project, the Web of Dataconsists of 19.5 billion RDF triples.

5http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

Fig. 3. Linking Open Data cloud that shows the data sets that have already beenpublished and interlinked. There are 570 data sets interconnected by 2909link sets. The arcs represent two connected datasets, As darker as the arcgets,the more links it has to other datasets.

As can be seen on Figure 3 DBpedia is the largest data set andcontributes more than 3 billion RDFs (17% of the total Web ofData).

However, one could also use already existing NLP-serviceswhich utilise sources such as DBpedia. We will shortly describethese tools.

B.1 AlchemeyAPI

AlchemyAPI offers a wide range of text analysis functionswhich we will shortly discuss. Once a document, in the form ofraw text or web-site, is fed to AlchemyAPI, it will try to do thefollowing[7]:Language identification AlchemyAPI is able to recognize over100 languages.Keyword extraction Alchemy tries to retrieve the words fromthe document that contribute to the topic of the text from thedocument.Concept extraction Concept extraction is very much like key-word extraction, however concepts do not have to be mentionedexplicitly e.g. if a given text speaks about two wizards Harryand Voldemort, the keyword extraction will identify them, butthe concept extraction will identify the concept magic instead.Entity extraction In this module AlchemyAPI tries to look ex-plicitly for entities that are defined on Linked Data sets such asDBpedia, YAGO or Freebase. This is a very important featurethat can be of interest for this dissertation.

B.2 OpenCalais

Another tool that offers the same variety of functions is Open-Calais. OpenCalais however has some shortcomings. As ofthis moment it is able to only process three languages: English,French and Spanish. For the supported languages OpenCalaiswill try to identify entities, events and facts [8]. For example,consider the following (fictional) excerpt:“Gandalf was appointed as head of the fellowship of the ring.He would first visit Saruman in Isengard, Middle-earth to seekadvice.”Within this excerpt OpenCalais should be able to deduce entities(e.g. Gandalf), events (e.g. Gandalf was appointed head of thefellowship) and facts.

C. Real Estate

There has been a lot of (on-going) research on how one canmanage the screen real-estate of a mobile device. An issue forboth mobile devices as well as wearables is the amount of in-formation that has to be displayed given the little space that isavailable. The way an application handles this problem affectsthe perception of the user and the way he is able to handle com-plex tasks on small-display devices [9].

One set of solutions for this problems is design principles.These principles are a set of guidelines on how to design userfront-ends to show as much as possible information without los-ing the general overview [10][11]. Another set of solutions isbased on sonically enhanced systems, in which sounds are usedto deliver information to the user and save space, e.g. soundnotifications instead of visual icons [12]. Interface manipula-tions are another way to solve this issue. In this case the useris presented with a lot of information and he is able to navi-gate through this information using techniques such as fish-eye,zooming and panning. [13]

III. SETUP AND IMPLEMENTATION

We have built a webservice in Python using Flask andApache. This webservice takes on the task of translating theimages sent from the client to text using one of the preferredOCR. Once this has been done, the result is sent to one of theNLP-services.The entities that are extracted from the text are then used to lookup media. We have selected four different media channels:Lineage informationIn the use-case of our prototype (Lord of the Rings, LOTR)there are at least 982 characters. This can have unwanted conse-quences where the user cannot remember the characters and howthey relate to the story. To remedy this we can show lineage in-formation of the character entities that we are able to recognise.Map informationAnother highly interactive media-channel, is the Hobbitproject6. This project uses WebGL and HTML5 capabilities tocreate an interactive map on which the user can view the loca-tions of the LOTR-world, as well as journeys of characters, morein depth.ImagesThe third channel that we use are GIF images. Many GIF ser-vices are available. we chose to use Giphy for its simple API7.It allows us to query the service for entities that we have recog-nised. Besides simplicity, Giphy also has the added-value ofhaving a lot of useful information about the GIFS that other ser-vices do not offer.TweetsLastly we selected to query tweeter for tweets on the characteror location entities we have recognised.Twitter not only shows other users experience but it can also in-clude embedded media. Two of these channels are hard-codedand only usable for this context whereas the GIF images andtweets are not. It is a challenge to find suitable media without

6http://middle-earth.thehobbit.com/7http://giphy.com and https://github.com/Giphy/

GiphyAPI

being context-specific.

This is then sent back to the client. The client is an Androidapplication, consists of several activities. One of them is a GIFgallery, in which GIF images for all the recognised entities areshown. Another activity is the entity feed. This is a feed whereall the available information (images, tweets, lineage and mapinformation) for one specific entity is shown.The flow of this application is visualised in Figure 4.The entity feed, GIFgallery and way of notification also tacklethe real-estate challenge. Using a feed allows us to pack asmuch heterogeneous information as possible in one location.The same analogy can be made for the GIF gallery. Further-more, by enabling the application to run as well in portrait modeas in landscape mode, the user can choose to see more informa-tion on the screen.We have also enabled the user to use different techniques whenthe information is densely packed, e.g. he has the ability tozoom-and-pan when viewing the images.

Fig. 4. The possible flow of the Android client application.

IV. EXPERIMENTS AND RESULTS

As we mentioned in the previous sections, the user is able tochose from two OCR-technologies and two NLP-services.We experimented with these to determine a default choice.

A. OCR

The first flaw was exposed when trying to test the OCR-software. The resolution of the camera on the wearable device(0.3MP) was too low for OCR to be performed. For this reason,we performed the test with a smartphone. We tried to run simu-lations for most wearable devices on the market and downscaledthe camera resolution to 5 MP.The results are shown (partly) in Figure IV-A.

Fig. 5. The average performance ofOCR software depending on the de-vice the pictures were taken with.

Fig. 6. The average performance ofOCROpus vs. Tesseract.

From the plot in Figure 5 it is clear that with an average ac-curacy of 1.4% the camera of the Epson Moverio BT-200 is tooweak to be used in an environment where one would need toperform OCR on the images taken from the device. The cameraimages simulating wearables suited with better imaging sensorshowever do seem to be well suited for the job with an averageaccuracy of 50.2%.If we look beyond the imaging devices, Figure 6 shows thatOCRopy outperforms Tesseract out-of-the-box. OCRopy man-ages to achieve an average accuracy of 40.4% versus themediocre accuracy of 11.2% of Tesseract. We believe that thisis due to the internal workings of both Tesseract and OCRopy,and more specifically that of OCRopy, which does a better jobof correcting the skew of images.

B. NLP

We also performed tests to find out which of the NLP-serviceshas the best accuracy. For this purpose ground truth files werecreated and the results from the NLP-services were compared tothese truth files. Parts of the results are shown in Figure IV-B

Fig. 7. The average performance ofNLP-services independently fromthe kind of text used.

From Figure 7 it becomes clear that Alchemy API is twice asgood in recognizing entities (25.61% vs. 12.41%).Furthermore, we can deduce that using ground truth text ren-ders the best result (22.2%). It is however comforting that thedifference is only about 6.4 %.

Based on these results, we conclude that for general entityextraction Alchemy API is a better fit. Although a combinationof the two is also possible as OpenCalais was able to recognize

some elements (besides relationships) whereas Alchemy did notand vice versa.

C. Client software

We will discuss some of the key elements that influence theuser experience and how well our prototype application behaves.The experiments conducted were designed to test whether theprototype in its actual form is able to survive in a basic real-lifeenvironment.Furthermore, for all the tests OCRopy was used as OCR-software (unless otherwise mentioned) and AlchemyAPI wasused as NLP-service.

• Time measurements: we measured the duration of the wholeprocess of taking an image, sending it to the server, processingresponse and notifying the user that content was available.We did this in two configurations: one in which a stable WiFiconnection was used (to simulate reading at home or at work)and one in which an edge connection was utilised (to simulatereading in the train or in remote areas).Although this is a pure networking test, it is important for theuser experience in general.• Memory measurements: in the given context, it is also inter-esting to take a look at the memory consumption. We are fetch-ing and showing a lot of images, but we are using devices whichare limited in memory.The wearable device we used for our tests for example had only1 GB available RAM memory.• Battery consumption: lastly we will investigate the impact ofthe prototype on the battery life of the device.We will start with the network time measurements. The resultsare listed in Figure 8.We used both OCR technologies here, because the differencebetween them is enormous.

Fig. 8. A comparison between the WiFi and edge network types.

It is clear that although OCRopy performs better in accuracy,the impact on the user experience is way worse regardless of theused network technology.While the whole process of taking an image, sending it and pro-cessing the response takes less than one minute with Tesseract,the same process needs roughly 5 minutes using OCRopy.Taking a closer at the results based on the used network tech-nology, it shows an unmistakeably a large flaw of the mobiledevices. Using edge technology the average processing time

(regardless of the OCR-technology) is about 7 minutes.

Next we will look at the memory consumption, as we men-tioned earlier the devices we are investigating have limitedmemory available. The Epson Moverio only has 1 GB of RAMavailable, for this reason we also performed the tests on a smart-phone which had 3 GB available.We tested what happened to the memory consumption when onewould open the GIF gallery once the results were processed. Thememory consumption can be seen in Figure 9.

Fig. 9. Memory consumption for the prototype on the Epson Moverio, whileopening the GIF gallery.

The memory usage may not seem important, but the lack ofmemory on the wearable devices seriously impacts the user ex-perience. Not only will the application be less smooth to runand slower in processing, but the prototype even crashes on theglass because the Android OS will terminate the application dueto memory-shortage.

Lastly, the battery consumption was tested to see how muchenergy is used if for example the user reads a book for 30 min-utes while using the application.In Figure 10 we can see a roughly linear battery consumption.

Fig. 10. Battery consumption over a 30 minute time period.

This means that the application can only be used for a 2 hourperiod before the device shuts down. Because of their batterycapacity, this is a major flaw in the wearable devices as well.

V. FUTURE WORK

We have built a working prototype that is able to detectcontext and reach extra media to enhance the user experience,though here is still some work left to be done.The first step one would take in a real-life situation is to train theOCR for better accuracy-results. This is also what we suggest isthe next step forward. By training the OCR software we couldachieve a better detection of entities and we would be able toenhance the experience even further.Based on our tests we believe that it might also be beneficial

to combine the data from the NLP-services. These can be usedto detect false-positives, i.e. characters and locations that arerecognized but are not mentioned in the text. By combining theservices, one could also detect relationships and dates which canbe important for dynamic time lines, as discussed below.Another possible enhancement would be to build the web-service as a more generic compound. At this moment, someof the media channels are hard coded. Our system can easily bemodified in order to add media channels that are generic. Forexample one could devise an REST-call that allows the user toadd its own media channel by providing the URL, API-keys andan example of the API response and its keywords.Implementing these improvements could enable the implemen-tation of a dynamic time line.

A. Dynamic time line

A problem that we did not mention is the time line problem.This problem consists of knowing when events took place andbeing able to place them in some chronological order.At this point we might show the user information (such as lin-eage information) about entities he has yet no knowledge of.What we suggest is to build a time line that is generated dynami-cally as the user progresses in the book. This can be achieved bybuilding a list of characters that have been encountered, the re-lationships that have been mentioned to that point and analysingcharacters and dates.

VI. CONCLUSIONS

In this dissertation we investigated how suitable wearable de-vices are when using them in context-sensitive areas and howwe can use that context to enhance the user experience.We applied this to a use-case where books were analysed in or-der to present the user with media that could augment his ex-perience. To this end, we needed to convert the image to text,extract entities from this text and look up media for these enti-ties.We tested OCR-software and several NLP-services and con-cluded that:

• OCRopy outperforms Tesseract out of the box in terms of ac-curacy• Tesseract is much faster than OCRopy• AlchemyAPI is better in extracting entities than OpenCalaisFrom these tests we were also able to find the first flaw of ourtest-device, the camera resolution is too low to correctly performOCR.We built a REST webservice that automates the OCR and NLP-process and looks up media for the extracted entities. A clientprototype was also developed for Android to show these mediato the user.Tests were also ran to measure the time using different kind ofnetwork technologies and we concluded that wearable devices(and mobile devices in general) are not suitable when a networkconnection is needed.To solve the issue of real-estate, we used sound notifications aswell as built-in techniques such as zooming and panning.We also built a feed and used built-in layout types that allowedus to pack as much information as possible in one location.

This showed us a third shortcoming of wearable devices: dueto their weak hardware specification they lack certain propertiesthat enable a smooth system. In our case this was the memory.All these flaws can be remedied by making trade-offs: using anexternal camera or smartphone for scanning, using WiFi onlyand showing less media in order to use less memory.We can thus conclude that wearable devices are certainly suit-able for user experience enhancement, although trade-offs are tobe made.

REFERENCES

[1] Ray Smith, “An overview of the tesseract ocr engine.,” in ICDAR, 2007,vol. 7, pp. 629–633.

[2] Thomas M Breuel, “The ocropus open source ocr system,” in ElectronicImaging 2008. International Society for Optics and Photonics, 2008, pp.68150F–68150F.

[3] Elizabeth D Liddy, “Natural language processing,” 2001.[4] P Spyns, “Natural language processing,” Methods of information in

medicine, vol. 35, no. 4, pp. 285–301, 1996.[5] Christian Bizer, Tom Heath, and Tim Berners-Lee, “Linked data-the story

so far,” International journal on semantic web and information systems,vol. 5, no. 3, pp. 1–22, 2009.

[6] Tim Berners-Lee, James Hendler, Ora Lassila, et al., “The semantic web,”2001.

[7] Joseph Turian, “Using alchemyapi for enterprise-grade text analysis,”Tech. Rep., AlchemyAPI, 08 2013.

[8] Marius-Gabriel Butuc, “Semantically enriching content using opencalais,”EDITIA, vol. 9, pp. 77–88, 2009.

[9] Minhee Chae and Jinwoo Kim, “Do size and structure matter to mobileusers? an empirical study of the effects of screen size, information struc-ture, and task complexity on user activities with standard web phones,”Behaviour & Information Technology, vol. 23, no. 3, pp. 165–181, 2004.

[10] Jacob Eisenstein, Jean Vanderdonckt, and Angel Puerta, “Adapting to mo-bile contexts with user-interface modeling,” in Mobile Computing Systemsand Applications, 2000 Third IEEE Workshop on. IEEE, 2000, pp. 83–92.

[11] Daniel Churchill and John Hedberg, “Learning object design considera-tions for small-screen handheld devices,” Computers & Education, vol.50, no. 3, pp. 881–893, 2008.

[12] Stephen A Brewster et al., “Sound in the interface to a mobile computer.,”in HCI (2), 1999, pp. 43–47.

[13] George W Furnas, Generalized fisheye views, vol. 17, ACM, 1986.[14] Jian Liang, David Doermann, and Huiping Li, “Camera-based analysis

of text and documents: a survey,” International Journal of DocumentAnalysis and Recognition (IJDAR), vol. 7, no. 2-3, pp. 84–104, 2005.

[15] Luca Chittaro, “Visualizing information on mobile devices,” Computer,vol. 39, no. 3, pp. 40–45, 2006.

Verbeteren van de gebruikservaring gebruikmakendvan context data en draagbare toestellen

Arya Ghodsi

Begeleider(s): Prof. dr. Peter Lambert, Prof. dr. ir. Rik Van de Walle, Ir. Jonas El Sayeh Khalil, IgnaceSaenen

Abstract—Mobiele en draagbare apparaten worden elke dag belangrij-ker. Deze masterproef onderzoekt of dit soort apparaten geschikt zijn voorhet verbeteren van de gebruikservaring gebruikmakend van context data.We zullen gebruik maken van een van deze apparaten in een use case waarwe extra media aanreiken aan de gebruiker op basis van een boek dat hijaan het lezen is.Er zijn echter enkele elementen die voor moeilijkheden kunnen zorgen.

Trefwoorden—draagbare apparaten, natuurlijke taal verwerkingnaturallanguage processing, ocr

I. INTRODUCTIE

DRAAGBARE computers worden steeds prominenter in hetdagelijks leven. Men moet daarom overwegen hoe deze

apparaten gebruikt kunnen worden om het dagelijkse leven teverbeteren en hoe ze meer dan een gadget kunnen worden. Dezetoestellen hebben veel potentieel door de manier waarop mensenze gebruiken. Ze hoeven niet meer naar hun hun portemonneeof tas te reiken. Met een snelle beweging kunnen ze veel infor-matie ontvangen in een keer.Binnen de categorie van draagbare computers kunnen we eenonderscheid maken tussen verschillende soorten, zoals slimmehorloges, draagbare apparaten die op het hoofd gedragen moe-ten worden of draagbare apparaten die zijn ontworpen voor hetopvolgen van fysieke activiteit.In dit proefschrift zullen we ons richten op apparaten zoalsGoogle Glass 1, ReconJET glasses 2, Epson Moverio 3 etc.

In deze masterproef onderzoeken we de Epson Moverio omte zien of hij geschikt is voor het tonen van aanvullende mediaaan de gebruiker om zijn ervaring te verbeteren. We hebben eenuse case waarbij de gebruiker extra media omtrent het boek dathij aan het lezen te zien krijgt.

Om dit te bewerkstelligen hebben we diverse mediakanalengeselecteerd.Om te weten welke media moeten worden getoond, moet meneerst weten welk boek de gebruiker leest. Als we veronderstel-len dat we geen voorkennis hebben over het boek, is de enigemogelijke oplossing te proberen om deze kennis af te leiden uitfoto’s die genomen worden als de gebruiker het boek aan hetlezen is. We moeten dus de foto’s die genomen werden met ditapparaat, omzetten in tekst die machines kunnen lezen. Dat kankunnen we bereiken met behulp van OCR. Zodra dat gebeurd is,moeten we de tekst die we hebben verkregen via de OCR kun-nen begrijpen.

1 urlhttp://www.google.es/glass/start/2http://www.reconinstruments.com/products/jet/3http://www.epson.com/cgi-bin/Store/jsp/Landing/

moverio_developer-program.do

Er zijn meerdere manieren om tekst te analyseren met behulpvan natuurlijke taalverwerking.Om te testen welke instellingen optimaal presteren, hebben weverschillende experimenten uitgevoerd.

We hebben een prototype van een applicatie ontworpen diemedia kan tonen. Het zware werk wordt uitgevoerd door eenwebserver. Al die media zijn gebaseerd op de context; dat wilzeggen de pagina die de gebruiker leest.In de volgende paragraaf zullen we een overzicht geven van ge-relateerd werk. In deel III, leggen we de opzet uit en benadruk-ken we enkele implementatiedetails. In hoofdstuk IV leggen wekort onze experimenten uit en worden hun resultaten beschre-ven. Tot slot wijzen we op een aantal kwesties voor toekomstigeverbeteringen in hoofdstuk V en we formuleren een conclusie inhoofdstuk VI.

II. LITERATUUR

In dit gedeelte beschrijven we een deel van de reeds be-staande literatuur over de drie belangrijkste gebieden van ditproefschrift. We beginnen met de beschrijving van de huidigeontwikkeling rond OCR-software. Vervolgens wordt de lezeringeleid in de wereld van semantisch redeneren. We eindigenmet een literatuurbeschrijving over de maximale benutting vande kleine schermresolutie die gebruikelijk is in de categorie vanapparaten die we onderzoeken in dit proefschrift.

A. OCR

Elke grote OCR-technologie die op de markt is, doet ongeveerhetzelfde. Gegeven een figuur als input, zal de OCR-softwareeerst proberen om wat preprocessing te doen. Daarna wordt delay-out geanalyseerd. Zodra de lay-out is geanalyseerd, zal desoftware proberen om tekstregels en woorden (in de afbeelding)te herkennen. Vervolgens worden verschillende benaderingengebruikt die proberen om woorden te herkennen (taalkundig).In de laatste stap zal tekst worden terugegegeven die door com-puters kan worden verwerkt. Een overzicht van deze stappen iste vinden in Fig. 1.

Fig. 1. De gebruikelijke stappen bij het verwerken van afbeeldingen door OCR-software.

We zullen twee grote OCR technologien bespreken: Tesseracten OCRopy (vroeger gekend als OCRopus).

A.1 Tesseract

Tesseract is een open-source OCR-engine dat ontwikkeldwerd door HP tussen 1984 en 1994. Tesseract had een aanzien-lijke voorsprong op vlak van nauwkeurigheid op de commercileconcurrenten, maar in 1994 werd de ontwikkeling volledig stopgezet. In 2005 HP maakte Tesseract echter open-source en heeftRay Smith (Google Inc.) de ontwikkeling verder gezet4 [1].

Tesseract gebruikt een traditionele stap voor stap pijpleidingvoor de verwerking van afbeeldingen. Sommige stappen wa-ren bijzonder nauwkeurig in de tijd waarin Tesseract is ont-staan. Vanwege zijn banden met HP, heeft Tesseract echter nooitzijn eigen pagina-indeling-analyse-algoritme gehad. Boven-dien steunde het op een gepatenteerde pagina-indeling-analyse-algoritme van HP. De eerste stap van de verwerking bestond uithet opslaan van de omlijning die verkregen werd uit een analysevan aangesloten componenten. Hoewel deze stap computatio-neel duur was, had het het voordeel dat bij het inspecteren vaninnestelingen van de omlijningen het gemakkelijker was om in-verse tekst te detecteren. Aan het einde van deze fase werdencontouren verzameld, door nestelen, en ontstonden er Blobs. Inde tweede fase werden Blobs in tekstregels georganiseerd. Delijnen en regio’s werden geanalyseerd en de lijnen werden danonderverdeeld in woorden.Na deze stap volgen er nog twee stappen voor de herkenning. Bijde eerste stap probeert de applicatie elk woord te herkennen. Elkwoord dat herkend wordt, wordt vervolgens naar een adaptieveclassifier gestuurd als trainingsdata. Dit adaptieve classifier pro-beert dan woorden te herkennen met een hogere nauwkeurig-heid. Het kan voorkomen dat de classifier nuttige trainingsdatanet iets te laat heeft verwerkt, daarom zal er een tweede stap opde pagina worden uitgevoerd. Tenslotte zal de applicatie vagelocaties herkennen en controleert het alternatieve hypothesenvoor de x-hoogte om tekst die niet in hoofdletters staat te lo-kaliseren. Deze architectuur kan worden teruggevonden in Fig.2[1](in het Engels).

4Het project is nu beschikbaar op https://code.google.com/p/tesseract-ocr/

Fig. 2. Architectuur van het open-source OCR programma Tesseract.

A.2 OCRopy (vroeger bekend als OCRopus)

OCRopus legt de nadruk op modulariteit, eenvoudige uit-breidbaarheid en hergebruik. In [2] bespreekt de auteur de be-langrijkste eigenschappen van de OCR-software.De architectuur van OCRopus is in feite bijna volledig hetzelfdeals in Fig. 1.De architectuur is gebouwd op een zodanige wijze dat er geenbacktracking is. De software gaat strikt voorwaarts en bestaatuit drie belangrijke stappen:

Layout analyse In deze stap probeert de software kolommentekst, blokken, tekstlijnen en de leesvolgorde te identificerenTekstlijn herkenning Deze stap is verantwoordelijk voor hetherkennen van tekst in een bepaalde lijn. OCRopus is ontwor-pen op een zodanige wijze dat het in staat is verticale en vanrechts-naar-links geschreven tekst te herkennen. Verder wordter een hypothese graaf geconstrueerd voor alternatieve herken-ning.Statistische taal modellering De laatste stap bestaat uit de inte-gratie van de alternatieve herkenningen uit de voorgaande stapmet behulp voorkennis over de taal, domein, woordenschat etc.

Beide OCR programma’s worden gebruikt in experimentenom te bepalen welk het meest geschikte is in ons geval.

B. Semantisch Redeneren

Dit onderwerp kan ook worden gezien als een onderdeel vanhet Natuurlijke Taal Verwerking(NTV) gebied waar reeds zeerveel onderzoek naar gedaan is.Bij NTV probeert men tekst te analyseren en op een menseljkemanier taal te verwerken door gebruik van machines, d.w.z.computers.Dit gebeurt met behulp van een reeks van (taalkundige) theo-rieen met technologien om textit mensachtige taalverwerking tebereiken. Met de laatste term, bedoelen we dat een machine instaat zou moeten zijn om het volgende te doen [3], [4]:1. Vat een gegeven tekst samen.2. Vertaal tekst van een taal naar een andere taal.3. Vragen kunnen beantwoorden omtrent de context van een ge-geven tekst.4. Conclusies kunnen trekken voor een gegeven tekst.Hoewel niet alle hierboven vernoemde punten interessant zijnvoor ons, zijn nummer 3 en 4 punten van belang voor ons onder-zoek. Men kan proberen informatie manueel af te leiden uit eenbepaalde tekst, hiervoor kan Linked Data (en Semantic Web)een grote hulp zijn.Linked Data is simpel gezegd een laag bovenop het normaleWorld Wide Web zoals wij het kennen. Het maakt het moge-lijk om getypeerde verbanden te leggen tussen gegevens uit ver-schillende bronnen. In tegenstelling tot Web 2.0 mash-ups dievaste data bronnen gebruiken, is Linked Data actief op een we-reldwijde, niet-gebonden data ruimte. [5], [6].

Het semantisch web is gebaseerd op principes die Berners-Lee gaf als richtlijnen in [6].

De Linking Open Data Project 5 is een een van de eerste ge-meenschapen die probeert het Web van Data te identificeren aande hand van reeds bestaande datasets die verkrijgbaar zijn on-der open licenties. Het converteert deze data sets naar RDF opbasis van de hierboven genoemde principes.[5]. Figuur 3 toonteen indrukwekkende collectie van datasets. Volgens de statistie-ken verzameld door dit project, bestaat het web van Data it 19,5miljard RDF triples.

Fig. 3. Hier kunnen we al de verbindingen zien tussen de verschillende soortendata sets. Er zijn 570 data sets die onderlingen verbonden zijn door 2009links. Een boog stelt een verbinding tussen twee data sets voor. Hoe don-kerder een boog wordt, hoe meer verbindingen er zijn met andere data sets.

Zoals te zien is op Fig. 3 is DBpedia duidelijk de grootstedata set en bestaat het uit meer dan 3 miljard RDFs (dat is 17 %van de totale Web of Data).

Men kan ook gebruik maken van reeds bestaande NTV-diensten die gebruik maken van deze bronnen (zoals DBpedia).We zullen deze diensten kort bespreken.

B.1 AlchemeyAPI

AlchemyAPI biedt een breed scala van tekstanalysefunctiesdie we kort zullen bespreken. Zodra een document in de vormvan pure tekst of websitelink, aan AlchemyAPI gegeven wordt,zal deze proberen om het volgende te doen[7]:Taal identificatie AlchemyAPI kan over 100 talen herkennen.Trefwoord extractie Alchemy probeert woorden uit de tekst tehalen die bijdragen aan het onderwerp van de tekst.Concept extractie Concept extractie lijkt zeer sterk op tref-woord extractie, echter concepten hoeven niet expliciet te wor-den vermeld, bv. als een bepaalde tekst spreekt over twee tove-naars Harry en Voldemort, dan zal trefwoord extractie ze identi-ficeren, maar het concept extractie zal het concept magie identi-ficeren.Entiteit extractie In deze module zal AlchemyAPI proberen omexpliciet op zoek te gaan naar entiteiten die zijn gedefinieerdzijn op Linked Data sets zoals DBpedia, YAGO of Freebase.Dit is een zeer belangrijke eigenschap die bijzonder interessantkan zijn voor deze dissertatie.


B.2 OpenCalais

Een ander hulpmiddel dat dezelfde verscheidenheid aan func-ties biedt, is OpenCalais. OpenCalais heeft echter enkele tekort-komingen. Op dit moment is het slechts in staat om drie talente verwerken: Engels, Frans en Spaans. Voor de ondersteundetalen zal OpenCalais proberen om entiteiten, gebeurtenissen enfeiten te identificeren. [8]. Neem bv. het volgende (fictionele)uittreksel als voorbeeld:“Gandalf was appointed as head of the fellowship of the ring.He would first visit Saruman in Isengard, Middle-earth to seekadvice.”Voor dit fragment zou OpenCalais entiteiten moeten kunnen af-leiden (in dit voorbeeld was OpenCalais in staat om Gandalfte identificeren), maar ook gebeurtenissen (OpenCalais was instaat om correct afleiden dat Gandalf werd benoemd tot hoofdvan de gezelschap) en feiten.

C. Beheer van kleine scherm resoluties

Er gebeurt veel (lopend) onderzoek naar hoe men het kleinescherm van een mobiel apparaat best kan benutten. Het pro-bleem dat de mobiele en draagbare apparaten achtervolgt is dehoeveelheid informatie die moet worden weergegeven gezien dekleine hoeveelheid aanwezige ruimte. De manier waarop eentoepassing met dit probleem omgaat, heeft ook invloed op degebruikers perceptie en de wijze waarop hij in staat is om com-plexe taken op kleine beeldschermen te behandelen [9].

Een reeks oplossingen voor deze problemen zijn principesomtrent het ontwerp. Deze principes zijn een reeks richtlijnenover hoe de front-ends te ontwerpen om zoveel mogelijk infor-matie te tonen zonder het algemeen overzicht te verliezen [10][11]. Een andere reeks oplossingen is gebaseerd op sonisch ver-beterde systemen, waarbij geluiden worden gebruikt om infor-matie te leveren aan de gebruiker en ruimte te besparen bv. ge-luidsmeldingen in plaats van visuele iconen [12]. Interface ma-nipulaties zijn een andere manier om dit probleem op te lossen,in dit geval de wordt gebruiker overspoeld met informatie en ishij in staat om te navigeren door deze informatie met behulp vantechnieken zoals fish-eye, zoomen en pannen enz. [13]

III. OPZET EN IMPLEMENTATIE

We hebben een webservice ontwikkeld in Python met behulpvan Flask en Apache. Deze webservice neemt de taak van hetvertalen van de beelden van de gebruikersapplicatie naar tekstvia OCR op zich. Zodra dit gebeurd is, wordt het resultaat naareen van de NTV-diensten verzonden.De entiteiten die worden herkend, worden vervolgens gebruiktvoor het opzoeken van de media. We hebben verschillende me-diakanalen gekozen:Stamboom informatieIn de boeken van onze use-case (In de Ban van de Ring(IBR))zijn er minstens 982 karakters.Dit kan gevolgen hebben wanneer de gebruiker niet meer weetover welk personage het gaat en welke rol dit personage speeltin het verhaal.Om dit te verhelpen kunnen we stamboom informatie tonen vanentiteiten die we in staat zijn te herkennen.Kaartinformatie

Een andere zeer interactief media-kanaal, is het Hobbit project6. Dit project gebruikt WebGL en HTML5 om een interactievekaart te creren waarop de gebruiker de locaties van de IBR we-reld diepgaander kan bestuderen en trajecten van bepaalde per-sonages bekijken.AfbeeldingenHet derde kanaal dat we gebruiken zijn GIF-afbeeldingen. Eris een overvloed van GIF-diensten. We hebben ervoor gekozenom Giphy 7 te gebruiken wegens de eenvoud van de API.Het stelt ons in staat om de service te bevragen over entiteitendie we hebben herkend.Naast eenvoud heeft Giphy ook de toegevoegde waarde dat heteen heleboel nuttige informatie over de GIFs die andere dienstenniet aanbieden.TweetsTot slot hebben we gekozen om Twitter te doorzoeken voortweets over de personage- of locatie-entiteiten die we hebbenherkend.Twitter toont niet alleen de ervaring van andere gebruikers, maarTwitter kan ook media toevoegen als bijlage.

Twee van deze kanalen zijn hard-coded en alleen bruikbaarin deze context, terwijl de GIF-afbeeldingen en tweets niet hardgecodeerd zijn. Het is een uitdaging om geschikte media te vin-den zonder context-specifieke bronnen te raadplegen.

De bovenstaande media worden vervolgens teruggestuurdnaar de gebruikersapplicatie. Die applicatie is een Android-applicatie. Het bestaat uit verschillende activiteiten. Een daar-van is een GIF galerij, waarbij alle afbeeldingen van herkendeentiteiten worden getoond. Een andere is een overzicht van deentiteiten, waarvoor een specifieke entiteit alle beschikbare in-formatie (foto’s, tweets, geslacht en de kaart informatie) wordtweergegeven.De mogelijke manieren waarop de toepassing kan gebruikt wor-den, is gevisualiseerd in Fig. 4.De entiteit overzicht, GIF galerij en de manier van notifcatiezijn tegelijkertijd onze oplossingen voor de kleine schermreso-lutie uitdaging.

6http://middle-earth.thehobbit.com/7http://giphy.com en https://github.com/Giphy/

GiphyAPI

Fig. 4. De algemene gebruiksgang bij het gebruik van de Android applicatie.

Met behulp van een overzicht, kunnen we zoveel mogelijkheterogene informatie verpakken in n locatie. Dezelfde ana-logie gaat op voor de GIF galerij. Bovendien kan de gebrui-ker, doordat de toepassing toelaat om zowel in portrait-mode alslandscape-mode te werken, kiezen om meer informatie op hetscherm te tonen indien hij dat wenst.We hebben ook de gebruiker de mogelijkheid gegeven om ver-schillende technieken te gebruiken wanneer de informatie dichtverpakt is. Zo heeft hij heeft de mogelijkheid om in te zoomenen te pannen bij het bekijken van de stamboom informatie.

IV. EXPERIMENTEN EN RESULTATEN

Zoals we in de vorige paragrafen zeiden, kan de gebruikerkiezen uit twee OCR-technologieen en twee NTV-diensten.We experimenteerden hiermee om een van elk in te stellen alsstandaard keuze.

A. OCR

De eerste tekortkoming van ons draagbare toestel kwam aanhet licht wanneer we probeerden om de OCR-software te testen.De resolutie van de camera van het draagbare apparaat (0.3MP)was te zwak om OCR uit te voeren. Om deze reden, voerdenwe de testen uit met een smartphone. We hebben geprobeerdom het draagbare apparaat te simuleren en namen foto’s in 5MPresolutie. De resultaten zijn (gedeeltelijk) te zien in Fig. IV-A.

Fig. 5. De gemiddelde prestatie vanOCR software afhankelijk van degebruikte toestel.

Fig. 6. De gemiddelde prestatie vanOCROpus vs. Tesseract.

Uit de grafiek in Fig. 5 is het duidelijk dat met een gemid-delde nauwkeurigheid van 1,4% de camera van de Epson Mo-verio BT-200 te zwak is om te worden gebruikt in een omge-ving waar het nodig zou zijn om de beelden te verwerken metOCR. De camerabeelden die de draagbare apparaten simuleer-den zijn duidelijk meer geschikt met een gemiddelde nauwkeu-righeid van 50,2%.Als we het toestel waarmee we foto’s namen niet in rekeningbrengen, kunnen we uit Fig. 6 afleiden dat OCRopy Tesseractovertreft wanneer we de software niet trainen. OCRopy slaagterin om een gemiddelde nauwkeurigheid van 40,4% te realise-ren ten opzichte van de matige nauwkeurigheid van 11,2% vanTesseract. Wij geloven dat dit te wijten is aan de interne wer-king van de software, in het bijzonder heeft OCRopy een beterealgoritme om de helling in afbeeldingen te corrigeren.

B. NTV

We voerden ook testen uit om te ontdekken welke van deNTV-diensten de beste nauwkeurigheid heeft. Daarvoor wer-den bestanden gemaakt met grondwaarheden en werden de re-sultaten van de NTV-diensten vergeleken met deze waarheid-bestanden. We voerden deze testen eenmaal met de teksten ver-kregen van de OCR en eenmaal met de grondwaarheid tekstenuit.Delen van de resultaten zijn weergegeven in Fig. IV-B

Fig. 7. Gemiddelde prestatie van NTV-diensten onafhankelijk van de soorttekst dat werd gebruikt.

Uit Fig. 7 is het duidelijk dat Alchemy API twee keer zo goedis als OpenCalais in het herkennen van entiteiten (25.61% vs.12.41%). Verder kunnen we afleiden dat de grondwaarheidtek-sten het beste resultaat (22,2%) opleveren. Het verschil bedraagt

echter slechts 6,4% .Op basis van deze resultaten concluderen we dat voor alge-

mene entiteit extractie Alchemy API beter gepast is. Een com-binatie van beide is eveneens mogelijk, omdat OpenCalais som-mige elementen kon herkennen (naast verbanden) die Alchemyniet kon herkennen en vice versa.

C. Gebruikers software

We zullen in dit hoofdstuk een aantal belangrijke aspecten be-spreken die de gebruikerservaring beınvloeden. We zullen ooknagaan hoe goed ons prototype applicatie draait.De experimenten werden ontworpen om te testen of het proto-type in zijn huidige vorm kan gebruikt worden in een eenvou-dige real-life omgeving.Voorts werden voor alle proeven OCRopy en AlchemyAPI ge-bruikt (tenzij anders vermeld).We hebben de volgende testen uitgevoerd:

1. Tijdsmetingen: we meten hoe lang het hele proces om eenfoto te nemen, het op te sturen naar de server, het verwerken vande antwoord van de server en de gebruiker te waarschuwen datnieuwe inhoud beschikbaar was, duurt.We deden dit in twee soorten situaties: een waarin een stabieleWiFi-verbinding werd gebruikt (het simuleren van lezen thuis ofop het werk) en een waarin een edge-verbinding werd gebruikt(het simuleren van lezen in de trein of in afgelegen gebied).Hoewel dit een pure netwerktest is, is het een belangrijke factordie de gebruikerservaring kan beınvloeden.2. Geheugenverbuiksmetingen: in de gegeven context, is hetook interessant om een blik te werpen op het geheugengebruik .We halen een heleboel beelden op en moeten deze ook tonen enwe gebruiken apparaten die beperkt zijn in het geheugen.Het draagbare apparaat dat we gebruiken voor onze tests hadbijvoorbeeld slechts 1 GB RAM-geheugen beschikbaar.3. Batterijverbruik: ten slotte zullen we de impact van het pro-totype op het batterijverbruik van het apparaat onderzoeken.We beginnen met de tijdsmetingen, de resultaten kunnen gezienworden in Fig. 8.We gebruikten beide OCR technologieen hier, omdat de ver-schillen tussen de twee groot zijn.

Fig. 8. Een vergelijk tussen WiFi en edge netwerk type.

Het is duidelijk dat hoewel OCRopy beter presteert in nauw-keurigheid (zie vorige sectie), de impact op de user experienceveel erger is, ongeacht de gebruikte netwerktechnologie.

Terwijl het hele proces van het nemen van een foto, verzendenen verwerken van de respons minder dan een minuut duurt metTesseract, duurt hetzelfde proces ongeveer 5 minuten met be-hulp van OCRopy.Los van de OCR-software kennen de mobiele apparaten duide-lijk zienbare. Indien er gebruik wordt gemaakt van Edge alsnetwerktechnologie neemt de gemiddelde looptijd (onafhanke-lijk van de OCR-technologie) toe tot 7 minuten.

Nu bekijken we het geheugenverbruik. Zoals we al eerdervermeldden, hebben de apparaten die we onderzoeken niet veelbeschikbaar geheugen. De Epson Moverio heeft bijvoorbeeldslechts 1 GB RAM-geheugen beschikbaar. Om het contrast telaten zien, voerden we ook de tests uit op de smartphone die 3GB geheugen beschikbaar had.We testten wat gebeurde met het geheugengebruik wanneer mende GIF galerij zou openen zodra de resultaten waren verwerkt.Het geheugengebruik is te zien op Fig. 9.

Fig. 9. Geheugen verbruik van het prototype op de Epson Moverio, wanneer deGIF galerij werd geopend.

Het geheugenverbruik lijkt misschien niet belangrijk, maarhet gebrek aan geheugen op de draagbare apparaten heeft eenserieuze invloed op de gebruikerservaring. Niet alleen zal detoepassing minder soepel draaien en langzamer gegevens ver-werken, maar het prototype crasht zelfs op het draagbaar ap-paraat omdat het Android OS de toepassing als gevolg van hetgeheugen-tekort zal beeindigen.

Tenslotte werd de batterijconsumptie getest om te zien hoe-veel energie wordt gebruikt. Als de gebruiker bijvoorbeeld eenboek leest gedurende 30 minuten met gebruik van de applicatie.In Fig. 10 zien we een lineair batterij verbruik.

Fig. 10. Batterij verbruik over een tijdsspanne van 30 minuten.

Dat betekent dat de applicatie slechts gebruikt kan wordenvoor een periode van 2 uur alvorens het apparaat zich zal afslui-ten. Ook dat is een andere grote tekortkoming van de draagbareapparaten.

V. MOGELIJKE AANPASSINGEN VOOR DE TOEKOMST

We hebben een werkend prototype dat in staat is om context tedetecteren en extra media kan aanreiken om de gebruikservaringte verbeteren, maar er zijn nog heel wat verbeteringen mogelijk.De eerste stap in een echte omgeving zou het trainen van deOCR zijn. Dit zal zorgen voor een betere nauwkeurigheid vande gefotografeerde tekst. Door het trainen van de OCR-softwarekunnen we een betere detectie van entiteiten bereiken en de ge-bruikservaring nog verder verbeteren.Op basis van onze tests zijn wij van mening dat het ook nuttigkan zijn om de data van de NTV-diensten te combineren. Ditkan worden gebruikt om valse positieven, d.w.z. karakters enlocaties die worden herkend maar niet in de tekst genoemd wor-den, te detecteren. Door de diensten te combineren, kan menook relaties en datums die belangrijk zijn voor dynamische tijd-lijnen (zie verder) detecteren.Een andere mogelijke verbetering zou zijn om de webserviceiets generieker op te bouwen. Op dit moment, zijn een aantalvan de media kanalen hard gecodeerd.Ons systeem kan eenvoudig worden aangepast om mediakana-len die generiek zijn toe te voegen. Zo kan men een REST-callbouwen waarmee de gebruiker zijn eigen mediakanaal kan toe-voegen door het verstrekken van de URL, API-sleutels en eenvoorbeeld van de API respons en de trefwoorden op te geven.De uitvoering van deze verbeteringen laten toe om een dyna-mische tijdlijn te implementeren. We zullen dit concept in hetvolgende gedeelte uitleggen.

A. Dynamic time line

Een probleem dat we niet hebben vernoemd, is het tijdlijnpro-bleem. Dat probleem bestaat uit het kunnen afleiden wanneergebeurtenissen plaatsvonden en de mogelijkheid om ze te plaat-sen in een chronologische volgorde.Op dit moment kunnen we de gebruikersinformatie (bijvoor-beeld stamboominformatie) over entiteiten tonen, waar de lezernog geen kennis mee heeft gemaakt.Wat wij voorstellen is om een tijdslijn op te stellen die dyna-misch wordt bijgewerkt wanneer de gebruiker vordert in hetboek. Dat kan worden verwezenlijkt door een lijst van perso-nages en relaties bij te houden, alsook door de analyse van per-sonages en data.

VI. CONCLUSIONS

In deze masterproef onderzochten we hoe geschikt draagbareapparaten zijn wanner ze gebruikt werden in context-gevoeligesituaties en hoe we ze kunnen gebruiken om de gebruikerserva-ring te verbeteren.We werkten een use-case uit die de gebruikerservaring bij hetlezen van boeken verbeterd. Hiervoor kreeg de gebruiker mediavoorgeschoteld die te maken hebben met zijn boek.Hiertoe was het nodig om foto’s om te zetten naar tekst, dezetekst te analyseren en hieruit entiteiten te filteren. In de laatstestap werd media gezocht die bij de geextraheerde entiteiten ho-ren. We testten verschillende NTV-diensten en OCR-softwareen kwamen tot de volgende besluiten:

• OCRopy overtreft Tesseract in termen van nauwkeurigheid(wanneer de software niet werd getraind)• Tesseract is veel sneller dan OCRopy• AlchemyAPI is beter in het extraheren van entiteiten danOpenCalaisUit deze tests waren we ook in staat om de eerste tekortkomingvan ons testapparaat te vinden. De resolutie van de camera isnamelijk te laag zodat OCR correct kan worden uitgevoerd.We bouwden ook een REST webservice dat het OCR- en NTV-proces automatiseert en media opzoekt voor de herkende enti-teiten. Om deze media te tonen aan de gebruiker werd er ookeen prototype applicatie ontwikkeld voor Android.Vervolgens hebben we tests uitgevoerd om de tijd te metenwanneer verschillende netwerktechnologieen werden gebruikt.Hieruit concludeerden we dat draagbare apparaten (en in het al-gemeen mobiele apparaten) niet geschikt zijn wanneer een net-werkverbinding nodig is.Om het probleem van de kleine schermresolutie op te lossen,hebben we gebruik gemaakt van geluidsmeldingen evenals in-gebouwde technieken zoals zoomen en pannen. We bouwdenook een overzicht en gebruikten ingebouwde layout types dieons toelieten om zoveel mogelijk informatie in een keer te to-nen.Dit toonde ons een derde tekortkoming van draagbare appara-ten. Vanwege hun zwakke hardwarespecificatie (in ons gevalhet geheugen) kan een vlot gebruik van de applicatie niet gega-randeerd worden.Al deze tekortkomingen kunnen worden verholpen door het ma-ken van afwegingen: het gebruik van een externe camera ofsmartphone om foto’s te nemen, gebruik van WiFi alleen enminder media tonen om minder geheugen te gebruiken. Kortomkunnen we concluderen dat draagbare apparaten zeker geschiktzijn om de gebruikerservaring te verbeteren, hoewel er afwegin-gen moeten worden gemaakt.

REFERENCES

[1] Ray Smith, “An overview of the tesseract ocr engine.,” in ICDAR, 2007,vol. 7, pp. 629–633.

[2] Thomas M Breuel, “The ocropus open source ocr system,” in ElectronicImaging 2008. International Society for Optics and Photonics, 2008, pp.68150F–68150F.

[3] Elizabeth D Liddy, “Natural language processing,” 2001.[4] P Spyns, “Natural language processing,” Methods of information in medi-

cine, vol. 35, no. 4, pp. 285–301, 1996.[5] Christian Bizer, Tom Heath, and Tim Berners-Lee, “Linked data-the story

so far,” International journal on semantic web and information systems,vol. 5, no. 3, pp. 1–22, 2009.

[6] Tim Berners-Lee, James Hendler, Ora Lassila, et al., “The semantic web,”2001.

[7] Joseph Turian, “Using alchemyapi for enterprise-grade text analysis,”Tech. Rep., AlchemyAPI, 08 2013.

[8] Marius-Gabriel Butuc, “Semantically enriching content using opencalais,”EDITIA, vol. 9, pp. 77–88, 2009.

[9] Minhee Chae and Jinwoo Kim, “Do size and structure matter to mobileusers? an empirical study of the effects of screen size, information struc-ture, and task complexity on user activities with standard web phones,”Behaviour & Information Technology, vol. 23, no. 3, pp. 165–181, 2004.

[10] Jacob Eisenstein, Jean Vanderdonckt, and Angel Puerta, “Adapting to mo-bile contexts with user-interface modeling,” in Mobile Computing Systemsand Applications, 2000 Third IEEE Workshop on. IEEE, 2000, pp. 83–92.

[11] Daniel Churchill and John Hedberg, “Learning object design considerati-ons for small-screen handheld devices,” Computers & Education, vol. 50,no. 3, pp. 881–893, 2008.

[12] Stephen A Brewster et al., “Sound in the interface to a mobile computer.,”in HCI (2), 1999, pp. 43–47.

[13] George W Furnas, Generalized fisheye views, vol. 17, ACM, 1986.

[14] Jian Liang, David Doermann, and Huiping Li, “Camera-based analysisof text and documents: a survey,” International Journal of DocumentAnalysis and Recognition (IJDAR), vol. 7, no. 2-3, pp. 84–104, 2005.

[15] Luca Chittaro, “Visualizing information on mobile devices,” Computer,vol. 39, no. 3, pp. 40–45, 2006.

Usage permission

“The author gives permission to make this master dissertation available for consul-tation and to copy parts of this master dissertation for personal use. In the case ofany other use, the limitations of the copyright have to be respected, in particularwith regard to the obligation to state expressly the source when quoting resultsfrom this master dissertation.”

Arya Ghodsi, Ghent January 2016

xvii

Preface

This dissertation has come to existence because of my passion for mobile devices.Developing this dissertation has allowed me to work with cutting-edge technologyas well as working with natural language processing. This would not have beenpossible without my supervisors prof. dr Peter Lambert and prof. dr. ir. Rik Vande Walle.I also want to express my gratitude towards my counsellors: Ir. Jonas El SayehKhalil and Ignace Saenen. They have been with me all the way and have advisedme on numerous count. Without them, this dissertation would have not been ascomplete as it is now.I also want to thank my partner, Naomi, for always supporting me and taking thetime to proofread this hefty document.Furthermore, I would like to thank my close friends Cedric, Joren, Sander andSteven for their time, help, and their suggestions. Finally, I want to thank myparents for their never-ending support which enabled me to pursue my degree.

Arya Ghodsi, Ghent January 2016

xix

List of Figures

1.1 Set of smart glasses. . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Usual flow of OCR software. . . . . . . . . . . . . . . . . . . . . . . 62.2 Algorithm of [EOW10] . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Algorithm by [MGLM12] for text-to-speech . . . . . . . . . . . . . . 112.4 The head mount presented by [GT09]. . . . . . . . . . . . . . . . . 112.5 The architecture of RMAP as presented by [SHY10]. . . . . . . . . 122.6 Architecture of the open-source OCR engine Tesseract. . . . . . . . 152.7 The flow of OCropy software. . . . . . . . . . . . . . . . . . . . . . 192.8 The flow of the language model of OCropy software. . . . . . . . . . 222.9 An example of hOCR output. . . . . . . . . . . . . . . . . . . . . . 232.10 Linked Open Data cloud of interconnected datasets. . . . . . . . . . 292.11 Example of info box using wiki-markup . . . . . . . . . . . . . . . . 312.12 Architecture of DBPedia. . . . . . . . . . . . . . . . . . . . . . . . . 332.13 Available tools for semantic reasoning . . . . . . . . . . . . . . . . . 372.14 The presentation problem as described by [Chi06]. . . . . . . . . . . 412.15 Solutions to the presentation problem. . . . . . . . . . . . . . . . . 422.16 Example of decision tree for AIO. . . . . . . . . . . . . . . . . . . . 442.17 Interface manipulation methods zooming and fisheye . . . . . . . . 462.18 The marquee menu gesture commands as presented by [BXWM04] . 482.19 Example of bullseye menu . . . . . . . . . . . . . . . . . . . . . . . 492.20 Caveat when using semi-transparent widgets. . . . . . . . . . . . . . 502.21 Setup of application test with sounds as base. . . . . . . . . . . . . 522.22 Virtual bullseye with sounds . . . . . . . . . . . . . . . . . . . . . . 53

3.2 Class diagram of client software. . . . . . . . . . . . . . . . . . . . . 633.3 Client application flow . . . . . . . . . . . . . . . . . . . . . . . . . 653.5 An example of lineage information. . . . . . . . . . . . . . . . . . . 67

xxi

LIST OF FIGURES LIST OF FIGURES

3.6 Example of map media. . . . . . . . . . . . . . . . . . . . . . . . . 683.7 An example of tweet-media. . . . . . . . . . . . . . . . . . . . . . . 70

4.1 Font differences between two books. . . . . . . . . . . . . . . . . . . 734.2 Two testing scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . 744.3 One page experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . 754.5 Sample images from Epson Moverio BT-200 wearable. . . . . . . . . 784.6 Image difference between Epson Moverio and generic smartphone. . 804.7 Webservice call for testing OCR. . . . . . . . . . . . . . . . . . . . 814.8 The accuracy of the OCR-software for pictures taken with the smart-

phone camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.9 The accuracy of the OCR-software for pictures taken with the Move-

rio BT-200. Note: Tesseract has accuracy 0% . . . . . . . . . . . . 834.10 The average performance of OCR software depending on the device

the pictures were taken with. . . . . . . . . . . . . . . . . . . . . . 834.11 The average performance of OCROpus vs. Tesseract. . . . . . . . . 834.12 A call to the NLP-test case web service. . . . . . . . . . . . . . . . 874.13 The accuracy of the NLP services with the ground truth text. . . . 884.14 The accuracy of the NLP services with text from the OCR-output. 884.15 Average Performance NLP services (independent from text) . . . . 884.16 Average accuracy NLP service (based on text) . . . . . . . . . . . . 884.17 Client test between Edge and WiFi . . . . . . . . . . . . . . . . . . 904.19 GIF gallery comparison between smartphone and Epson Moverio. . 934.20 Battery consumption over a 30 minute time period. . . . . . . . . . 94

5.1 Lineage information for Sam without dynamic time line . . . . . . . 965.2 Lineage information for Sam with dynamic time line . . . . . . . . . 97

xxii

List of Tables

1.1 Characteristics of three wearable smart glasses . . . . . . . . . . . . 2

2.1 Overview of OCropy’ layout analysis methods. . . . . . . . . . . . . 202.2 Text-line recognition tools in OCropy. . . . . . . . . . . . . . . . . . 21

3.1 Experimentation hardware setup . . . . . . . . . . . . . . . . . . . 543.2 Used software solutions . . . . . . . . . . . . . . . . . . . . . . . . . 553.3 Possible fields of the HTML-form sent from the client. . . . . . . . . 573.4 Command-line options for OCR scripts . . . . . . . . . . . . . . . . 593.5 Image properties of Giphy. . . . . . . . . . . . . . . . . . . . . . . . 69

4.1 Overview of test scenarios. . . . . . . . . . . . . . . . . . . . . . . . 774.2 Different light settings of the tests. . . . . . . . . . . . . . . . . . . 784.3 OCR accuracy results for the generic smartphone camera. . . . . . . 824.4 OCR accuracy results of Epson Moverio BT-200. . . . . . . . . . . . 824.5 Possible keys for the ground truth files. . . . . . . . . . . . . . . . . 854.6 The number of entities per ground-truth scenario. . . . . . . . . . . 864.7 The accuracy results for the NLP services for the ground truth texts. 884.8 The accuracy results for the NLP services for the OCR texts. . . . . 88

xxiii

Contents

Abstract ii

Usage permission xvii

Preface xix

List of Figures xxi

List of Tables xxiii

1 Introduction 11.1 The OCR Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 The Semantic Challenge . . . . . . . . . . . . . . . . . . . . . . . . 31.3 The Real Estate Challenge . . . . . . . . . . . . . . . . . . . . . . . 31.4 Other Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related work 52.1 OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Text detection, localization and segmentation . . . . . . . . 62.1.2 Text recognition technologies for the visually impaired . . . 92.1.3 OCR software . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Semantic Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2.1 Language Identification . . . . . . . . . . . . . . . . . . . . . 242.2.2 Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.3 Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2.4 Available tools . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3 Real Estate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.3.1 Design principles . . . . . . . . . . . . . . . . . . . . . . . . 40

xxiv

CONTENTS

2.3.2 Interface manipulation . . . . . . . . . . . . . . . . . . . . . 452.3.3 Customized widgets . . . . . . . . . . . . . . . . . . . . . . . 482.3.4 Sonically enhanced systems . . . . . . . . . . . . . . . . . . 51

3 Setup 543.1 REST-webservice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553.2 OCR-software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.3 NLP-software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3.1 Alchemy API . . . . . . . . . . . . . . . . . . . . . . . . . . 613.3.2 OpenCalais . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.4 Client software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.4.1 Client architecture . . . . . . . . . . . . . . . . . . . . . . . 623.4.2 Fetching and processing webservice response . . . . . . . . . 663.4.3 Possible media . . . . . . . . . . . . . . . . . . . . . . . . . 66

4 Experiments 724.1 OCR software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.2 NLP software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.3 Client software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5 Future work 955.1 Dynamic timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Conclusion 99

References 101

Appendices 107

A Setup WSGI with Apache 108

B Shell scripts 110

xxv

1Introduction

With wearable computers becoming more prominent in everyday life,one must consider how these devices can be used for enhancing ev-eryday life and how they can become more than just a gadget. Thesedevices have a lot of potential based on how people operate them.

They do not have to reach for their pockets or purse. One swift gesture and theycan receive a lot of information at once. Within the category of wearable comput-ers we can distinguish several categories such as smart watches [Joh14][YMOF14],devices worn on the head of the user [LLV+13][YMOF14] or wearable devices thatare designed as activity tracking devices[SHM+14].In this dissertation we will focus on devices such as the Google Glass1, ReconJETglasses2 and Epson Moverio3.The aforementioned devices can be seen in Figure 1.1. We have listed their im-portant characteristics for this dissertation in Table 1.1.

These devices enable developers to unleash the full power of informatics becausethey are able to provide a lot of data and are able to provide feedback informationin an augmented manner. This is the first problem that needs to be tackled. Weneed to investigate the data capturing and reception capabilities, and how thesedata are to be processed for specific applications. Data can, for example, be usedto deduct the user’s activity and enhance that experience. In this dissertation wewill pursue the use case in which we try to enhance the user experience while heis reading a (paper) book. The device (assisted by a companion device/service forhigh-performance computing tasks) simultaneously records what is read, processesand interprets this data. This data can then be used to find additional information

1http://www.google.es/glass/start/2http://www.reconinstruments.com/products/jet/3http://www.epson.com/cgi-bin/Store/jsp/Landing/moverio_developer-program.do

1

http://www.google.es/glass/start/

http://www.reconinstruments.com/products/jet/

http://www.epson.com/cgi-bin/Store/jsp/Landing/moverio_developer-program.do

1.1. THE OCR CHALLENGE CHAPTER 1. INTRODUCTION

(a) An Epson MoverioBT-200 smart glasses.

(b) The highly commer-cialized Google Glasses.

(c) The Recon Jet smartglasses.

Figure 1.1: Set of smart glasses.

Google Glass Recon Jet Epson Moverio BT-200

Camera (in MP) 5 720p camera 0.3

Display resolution (pixel dimension) 640 x 360 16:9 WQVGA 960x540

Battery (in mAh) 570 UNKOWN 2720

Table 1.1: Three wearable smart glasses with their important characteristics. Notethat the Recon Jet is at the moment of writing not available and data shown hereis as advertised on the manufacturers web-site

such as definitions or translations of words, locations, a photograph of a word thatis recognised as an object or trailers of film adaptations etc. More generalised wewill try to return user objects of interest.

We believe that this use case presents a viable application of these devices.Books are still one of the preferred media for leisure and passing on information[Bra14, WDB10]. Yet books can be very complex, they can handle very complexscientific matter or have a plot that is hard to follow with many characters andlocations. In these cases a wearable device can offer a long term solution.

In what follows we will given an overview of what we will be resolving in thisdissertation, using the above described use case.

1.1 The OCR Challenge

When trying to provide the user with additional media, we first need to recognizewhich book he is reading. For this purpose we can make use of freely availableOCR4-software. OCR is the act of processing images and videos in order to detect

4Optical Character Recognition

2

1.2. THE SEMANTIC CHALLENGE CHAPTER 1. INTRODUCTION

text and translate these into machine-readable content[LDL05]. However usingOCR-software already poses a first challenge.In general, wearable devices are not equipped with high specification hardwarebecause of economic and ergonomic reasons. This also applies to the cameras thatare usually found on these devices (see first row Table 1.1). As we will see in thefollowing chapters, OCR software has a set of certain minimum requirements inorder to generate a machine-readable output.Another issue that might be a challenge is the sitting posture of the reader, thiscan introduce extra undesirable factors such as skew when trying to become a goodOCR-representation.This is the first possible problem that has to be resolved: how can we achieve agood optical character recognition with these factors?

1.2 The Semantic Challenge

Once we have solved the first problem, we will have a machine-readable text format.This is however not sufficient, we need to discover which book the user is reading.Failing to discover this, might render the application useless.To this end, we have to explore the means available to deduce the semantic meaningof the text we just uncovered.Though this step seems trivial, it will prove itself to be a difficult task because ofthe multifaceted problem semantic reasoning is.We do not only need to understand what the user is reading, we also need tounderstand which extra content can be linked to what he is reading in order to beable to provide him with valuable content.

1.3 The Real Estate Challenge

In the last stage we have to face yet another major difficulty which comes with thehardware specification of the devices we are looking into.Most of these devices have a very limited display size (see second row of Table 1.1)which leads to the following question: given a small display, how will we be ableto show the content we discovered from the previous stage in such a way that itallows maximum usage of the display without disturbing the user in his readingexperience.We dub this problem as the Real Estate problem: when we are given a certainamount of land (=display resolution), how can we make full use of this land tobuild real estates (=generated content). There are several existing methods tosolve this problem, we will investigate which method yields the best results.

3

1.4. OTHER CHALLENGES CHAPTER 1. INTRODUCTION

1.4 Other Challenges

Lastly there are also other minor challenges that need to be addressed. Due to thenature of wearable devices we need to look at several other issues that may arisesuch as, but not limited to:

Battery Life These devices are meant to be worn and hence they provide littlebattery capacity (see last row of 1.1 for specific capacities). We need to makeimportant decisions for the sake of conserving juice as the user will punishthe application for being too severe on the battery life.

Privacy Once an application is aware of which content the user reading, thismight pose several privacy issues. Therefore, We are obligated to look at thesocial factor of the application.

Although these challenges are not the main topic of this dissertation, we will payattention to them while pursuing our research. Given the challenges of wearabledevices, we want to formulate an answer to the following question “How capableare wearable devices of enhancing the user experience using context data?”. Wehave selected a use case in which the Lord of the Rings books are used as example.

1.5 Dissertation outline

In the following chapter we will examine the available literature on the previousmentioned challenges. In Chapter 3 we will discuss the setup and implementationof our solutions to tackle these problems. Chapter 4 will elaborate on experimentsthat were conducted and discuss their results. We will describe possible futurework in Chapter 5. Lastly we will conclude this dissertation in Chapter 6.

4

2Related work

This chapter will describe the existing literature concerning the three mainareas (OCR, semantic reasoning and real-estate) of this dissertation. Wewill start by describing how current development of OCR-software works.Secondly, the reader will be introduced to semantic reasoning and we

will finish by describing literature dealing with the small display resolution of thecategory of devices we are investigating in this dissertation.

2.1 OCR

Every major OCR technology that is on the market right now follows roughly thesame model. Given a figure as input, the OCR software will do the following:

• Preprocess image

• Analyze layout

• Recognize text lines and words in image

• Recognize words in linguistic terms

• Output machine-understandable text

An overview of these steps can be found in Figure 2.1. We will discuss the currentOCR software more in depth in Section 2.1.3.

5

2.1. OCR CHAPTER 2. RELATED WORK

Figure 2.1: Usual flow of OCR software.

2.1.1 Text detection, localization and segmentation

In order to correctly identify what the user is reading, the first step is to detecttext in the images we captured from our wearable device. In [LW02] text lines areidentified using a complex-valued multilayer feed-forward network that is trainedto detect text at a fixed scale and position. This work differs from others in sucha way that the method cannot only handle video, but it also processes compleximages that contain text on websites, for example. This output can also be fed totraditional OCR software for further inspection. Moreover, a top-down approachis used to allow text localization and text segmentation. This approach allowed theauthors to build a truly multi-resolution algorithm which can handle small imagesas well as large ones. They can handle videos that range from MPEG-1 to HDTVMPEG-2. First, the algorithm tries to locate potential text lines in images, webpages or video frames. In the case of video the algorithm exploits temporal redun-dancy, i.e. the algorithm exploits the fact that each text line appears over severalneighbouring frames. This will not only increase the chances of text detectionand localization, but it will also enhance text segmentation as well as removingfalse text alarms in individual frames since they are usually not stable through-out time. Using a difficult, real-world test set of video frames, with tens of hoursof home videos, newscasts and title sequences of feature films, [LW02] reports a

6


69.5% success rate of detecting text boxes correctly. This number however risesup to 94.7% when temporal redundancy is exploited. They were able to correctlysegment 79.6% of all characters and 88% of them were correctly recognized.

A hybrid algorithm for text detection and text tracking is presented in [GK].This work attempts to present a real-time scene text detection for low-resourceswearable devices. Their method combines two separate modules which are bothbased on MSER (Maximally Stable Extremal Regions). This allows the modulesto be combined in harmony which improves the robustness and boosts the speedof the system. The most interesting part of this method is that it can also dealwith rotation, translation, scaling and perspective deformations of detected text.The main idea of the presented method is that even with a slow text detectionmethod, it is still possible to achieve real-time performance. This can be achievedby running the detection algorithm periodically. In meanwhile a fast trackingmodule handles the propagation of the previous detection for the frames that arenot processed by the the detection module. In the first step of the algorithm aMSER is extracted from a given still frame in order to find tex-region candidates.This produces pure text-only groups with almost all text parts clustered together,after which two Real Adaboost1 classifiers are used to filter non-text groups fromtext-groups. In the last step some simple post-processing steps are performed inorder to obtain text line level bonding boxes. In the second step of the trackingmodule of [GK] a component tree is used to solve the correspondence problem, i.e.tracking single MSERs in successive frames is a correspondence problem withina windows surrounding their previous location. Furthermore [GK] notes that thesearch process is done within all the regions of the component tree, which allows tofind correct matches even if the target region is for examples blurred (and thereforehas lost its stability criteria).The tracking module uses invariant moments as features to find correspondenceand they also consider groups of regions, i.e.text lines, instead of one single MSER.This enables detecting mismatches using RANSAC2 when there is no fit with anunderlying line model. [GK] reports that the method is able to detect and tracktext with an average of 15 frames per second (fps) on a low-resource Android devicewith 1.5 GHz quad-core tablet. This performance is due the fast performance ofthe text tracking module, which handles the tracking in a negligible 40 ms onaverage.

[EOW10] puts forth a new local image operator which then tries to find thevalue of stroke width for each image pixel. Due to its local character the so-calledStroked Width Transform (SWT) operator, is fast and robust. It allows to detect

1Adaboost (Adaptive Boosting) is a machine learning meta-algorithm. It can be combinedwith other learning algorithms in order to improve performance.

2“RANSAC(Random sample consensus) is an iterative method to estimate parameters of amathematical model from a set of observed data which contains outliers”

7


text in natural scenes in many fonts and languages. The SWT operator transformsthe image data from containing colour values to containing the most likely strokewidth. After processing, the system is able to detect text, regardless of scale, fontand language. SWT uses the constant stroke width feature of text to distinguishtext from scenery. The main idea behind [EOW10] is how computing the strokewidth for each pixel. This idea is unique because it allows smart grouping of pixelsinstead of looking for separating feature per pixel like gradient or colour, nor doesit utilize any language-specific filtering mechanisms, such as OCR filtering stage,as discussed later.The whole algorithm uses a bottom-up integration and merges pixels with a similarstroke width into connected components. An overview of the steps of this algorithmcan be seen in figure 2.2.The algorithm works as follows: First the edges are computed using the Cany edgedetector, after this the SWT transformation is performed.The output of the SWT operator is an image with the same size as the image thatis fed to the operator as input. In this image each element contains the width ofstroke, i.e. a contiguous part of an image that forms a band of a nearly constantwidth, associated with the pixel.After this step, the algorithm tries to find letter candidates, it does so using theSWT output and groups pixels that most likely belong to the same stroke, i.e. twoneighbouring pixels may be grouped if they have similar stroke width. Then, usingsimple rules, potential components that contain text are transformed into lettercandidates. The next step consists of filtering groups of letters, because singleletters do not usually appear in images. This filtering allows to delete randomlyscattered noise. Using the suggestion that text appears in a linear form, thealgorithms then assumes that text on a line will have similar stroke, letter widthand letter height and it uses this knowledge to define candidate pairs of letters.The candidate pairs from the previous steps are then clustered together into chains.Two chains can be merged if they share one end and have similar direction. Thisprocess continues until no chains are left. The produced chains that are longenough are considered to be a text line. In the final step, text lines are broken intoseparate words, using a heuristic.

8


Figure 2.2: The steps of the algorithm described by [EOW10]

Using the ICDAR3 test set of images, the f-measure4 is used to compare algo-rithms in the field of text recognition. The output of each algorithm tested withthe f-measure is a set of rectangles which are the bounding boxes for detected texts.This set is called the estimate. ICDAR also provides a data set of ground-truthboxes which is called the targets. The match mp between two rectangles is definedas the area of the intersection divided by the area of the minimum bounding boxcontaining both rectangles. For each estimated rectangle, the closest match wasfound in the set of targets, and so the best match of a rectangle r in a set ofrectangles R , m(r; R) is defined as:m(r; R) = max {mp(r; r0|r0 ∈ R)} The f-measure consists of two other measures,the precision and recall.

The algorithm presented by [EOW10] has a precision of 0.73 and recall of 0.60,giving it a f-measure of 0.66 in only 940 ms, ranking it first place in the ICDAR2003 and ICDAR 2005 contest.There are some cases in which the algorithm presented by [EOW10] is not able todetect the text. These cases are:strong highlights, transparency text, size that is out of bounds, excessive blur andcurved baseline.In the following subsection we will shortly discuss the already existing wearabletechnologies that have text recognition on board and have come to existence forthe visually impaired.

2.1.2 Text recognition technologies for the visually impaired

The authors of [MGLM12] present a hat which is embedded with camera, speakerand USB port. The camera within this wearable device shoots real-time videos

3International Conference on Document Analysis and Recognition4F-measure is a statistical way of measuring a test’s accuracy.

9


which are then sent via the USB port to a small laptop, which is to be carriedaround. This method starts with detecting candidate text regions in the imagestream that the camera in the wearable provides using their MSER-based approach.These regions are tracked in consecutive frames and are then fed to an OCR engine,whose output is then forwarded to a TTS engine which synthesizes speech fromthe given input. This work-flow can be seen in Figure 2.3.Their MSER-based approach consists of using a linear time MSER algorithm whichis used in combination with a hierarchical filtering. The original image is analysedtwice, once to find MSER regions by applying the MSER algorithm. This willproduce light regions inside dark regions (these are called MSER+ regions). Thesecond time the MSER algorithm is applied on the inverse (or negative) of theoriginal image, which produces dark regions inside light regions (so-called MSER-regions).These two set of regions are disjoint and can then be used to produce a hierarchicalMSER tree. This tree is then pruned and filtered using a simple set of rules. Theresult of this filtered tree is then, as mentioned before, sent to the OCR engine forrecognition. Using the ICDAR 2003 test set, the algorithm had a precision of 0.51and a recall of 0.71, giving it an f-measure of 0.55 in 2.43 seconds on average.The algorithm also reports a processing speed of 14 fps for 640× 480 and 9 fps for800× 600.

The authors of [GT09] propose a head mount camera system that is connectedto a laptop pc running a Linux distribution. The camera is 380k-pixel, the signalof this camera is converted to a DV-stream using a NTSC-DV converter. Thiswork proposes to extract text string using a revised DCT-based method. Afterthis step the text regions are grouped into image chains using a tracking methodthat is based on particle filters. The steps of the algorithm can be seen in Figure2.4. The authors note that the system is not equipped with a character recognitionprocess because they solely focused on text detection, tracking and image selection.Images are divided in two groups using discriminant analysis-based thresholdingand a high-wide frequency band of DCT coefficient matrix. These two groups areeither text groups, which have feature values that are greater that the automati-cally found threshold or non-text groups.Once the text groups are extracted, connected components of the blocks are gen-erated and the bounding box of each connected component is regarded as a textregion. Text regions are tracked using a particle filter. This means that particlesare scattered around the predicted centre of a text block in the current frame. Thisprediction is derived from the centre of each text block in the previous frame. Theparticles are then weighted. Particles that fall outside any text block are given azero weight. If they fall into a text block, they are given a similarity value that iscalculated using the previous and current text block.

10


Figure 2.3: The method presented by [MGLM12]. Note that the hat in this imageis not the actual wearable device.

Figure 2.4: The head mount presented by [GT09].

11


Figure 2.5: The architecture of RMAP as presented by [SHY10].

Using these weight regions are merged and text images are selected and filtered.Tests were performed using two scenes, the first contained eight signboards whilethe second contained 19 signboards. A visually impaired person wearing the cam-era went on his way in these scenes and respectively 1000 and 730 video frameswere captured. The algorithm passed in total respectively 66 and 77 images to thecharacter recognition which is respectively 8.3 and 4 times more than it shouldnormally have done so. Furthermore, the authors report an average processingspeed of 9.3 fps.

In [SHY10] a similar system called R-MAP is built as in [GT09]. Howeverinstead of using camera and an accompanying laptop an Android phone is used.In this system they used the Tesseract OCR engine, which we will discuss in Sec-tion 2.1.3. The architecture of R-MAP can be viewed in Figure 2.5. The authorsnote that due the usage of a phone camera several artefacts can occur in the imagesthat may deteriorate the OCR output (such as skew, tilt and light environment).The authors use the ISRI5 metrics to measure the accuracy of RMAP.These metrics are the character and word accuracy. Given the number of charac-ters n in an image and m the number of errors resulting from the OCR engine,then the character accuracy is defined as follows:

5Information Science Research

12


Character accuracy = n−mn

The word accuracy on the other hand is defined as:Word accuracy = m

n

Where m and n are respectively the amount of words that were recognized by theOCR engine and the total amount of words in an image.The authors used four different test corpora, each of these contained images thatwere taken with an HTC G1 under different conditions (indoor and outdoor lightenvironment, skew and tilt).The authors report a 96% and 100% word accuracy respectively in indoor and out-door well lit environments. The word accuracy for skewed images is slightly lessranging between 85-90% (this is mainly because the OCR engine is able to handlea skew of ±10 degrees). For images that were tilted, i.e. perception distortionwhen the text plane is not parallel to the image capturing plane, a word accuracyranging between 66-81% was reported. The authors note that this can be easilyoptimized using the orientation sensors that an average phone provides.Overall RMAP also has a very low false positive range (2%, 11%,5% for respec-tively indoor and outdoor light environment, skewed images and tilted images).Furthermore this accuracy rates can be improved by training the OCR engine,given it is known to benefit from the use of an adaptive classifier, as discussedbelow.

2.1.3 OCR software

Before we dive into the current software that is available for OCR, we will discussthe problems that may occur when using OCR software more in depth when usingwearable devices [EOW10][GT09]. The research area makes a distinction betweengraphic text, which can be found on top of images (e.g. subtitles) and scene text(e.g. name tags) when processing documents using OCR. When tackling the OCR-problem, devices like scanners are used which can have a resolution as high as 2400dpi6. Consumer cameras are expanding to 8 mp; with a resolution of 3500× 2200this is more than enough for OCR which requires around 300 dpi.When it comes to video, it has been shown that a resolution up to 640 × 480 issufficient for handling scene text[LDL05].

Although wearable cameras have not been designed for document processingand OCR in general, they do have some advantages compared to scanners. Theyare small, easy to carry and can be used in any environment.They can also be used for books and newspapers that are difficult to scan e.g. astudy has shown that PC-cams are more productive than scanner-based OCR forextracting text from newspapers. The latter study also found that digital cameras

6dots per inch

13


are capable of capturing a whole A4 size document at 200 dpi[LDL05].The authors of [LDL05] then continue to describe some of the challenges that

may arise when using wearable cameras. We will shortly describe them here:

Low resolution OCR engines usually need a resolution between 150-400 dpi towork properly, however images that are produced with wearable technologydo not have such high resolutions, their resolution can drop even under 50dpi which makes segmentation very difficult.

Uneven lighting Scanners have a very good control over the lighting conditions,wearable cameras do not have this amount of control; moreover there areadditional complications such as artificial light.

Perspective distortion When the text plane does not appear parallel with theimage plane, this will have the effect that characters farther away looksmaller. Small and mild distortion can cause huge problems for some OCRengines.

Warper The problem with scene text is that it does not always appear on a planee.g. books are almost never flat and more often curled.

Complex background Some images contain very complex background and theabsence of a uniform background can make segmentation very difficult.

Zooming and focussing Images taken by a wearable camera have mostly a fo-cus, which can produce blurry edges. Sharp edges, however, are essential forcharacter segmentation and recognition. However, this can also be a benefit,because it also allows to zoom or focus to the area of interest.

Moving objects The nature of wearable devices implies that the device or targetmay be moving, which may result in blurred images.

Sensor Noise Dark noise and read-out noise are two sources of noise at theCCD/CMOS stage in digital cameras. Additional noises can be generated inamplifiers.

Compression Almost all images that are captured using wearable camera’s havesome sort of compression. It is possible to obtain the uncompressed pictureat the cost of storage space.

Lightweight algorithms The optimal solution would be to embed documentanalysis into the devices themselves, however these devices are usually resource-poor which requires efficient, lightweight algorithms.

14


Figure 2.6: Architecture of the open-source OCR engine Tesseract.

Tesseract

Tesseract is an open-source OCR engine developed by HP between 1984 and 1994.Tesseract had a significant lead in accuracy over the commercial engines, but in1994 the development stopped completely. However, in 2005 HP made Tesseractopen-source and Ray Smith continued the development as head developer7 [Smi07].

Tesseract uses a traditional step-by-step pipeline for processing. Some of thesesteps were peculiar back in the days in which Tesseract flourished. However dueto its ties with HP, Tesseract never had its own page layout analysis and it reliedon a proprietary page layout analysis of HP.In the first step the outline of the components are stored using connected compo-nent analysis.Although this step was computationally expensive, it had an advantage that wheninspecting the nesting of outlines, the number of children and grandchildren out-lines was easier to detect in inverse text. At the end of this stage outlines weregathered together, by nesting, hence creating Blobs. In the second stage, Blobsare organized into text lines, and the lines and regions are analysed and the linesare broken into words.After this step a two-step process would follow for recognition. In the first pass theengine tries to recognize each word; each word that is recognized is then passed toan adaptive classifier as training data.This adaptive classifier then tries to recognize words with a higher accuracy. Giventhat this adapter may have learned useful elements just a tad too late, a secondpass is performed on the page.Finally the engine tries to resolve fuzzy spaces and checks for alternative hypothe-ses for the x-height in order to locate small-cap text. This architecture can befound in Figure 2.6[Smi07].In the following paragraph we will describe Tesseract more in depth and we willalso discuss how Tesseract’s page layout analysis works.

7The project is now available at https://code.google.com/p/tesseract-ocr/

15

https://code.google.com/p/tesseract-ocr/


Line and word findingThe line finding algorithm of Tesseract was designed in such a way that it was

able to recognize lines on skewed pages without having to de-skew the page andkeeping the image quality intact. The key parts of this process are blob filteringand line construction.Assuming that text regions are provided using page layout analysis, a percentile-height-filter removes the drop-caps and vertically touching characters. The medianheight provides an approximation of the text size in the given region. Using thesemedian blobs, which were smaller than some fraction of this median, were filteredout (these blobs were most likely punctuation, diacritical marks or noise).These filtered lines were likely to fit a model of non-overlapping, parallel but slop-ing lines.When the blobs are processed and sorted based on their x-coordinate, it is possibleto assign blobs to a unique text line.Once each blob is assigned to a unique text line, a least-median-of-squares fit isused to estimate the baseline and the filtered-out blobs were fitted back into ap-propriate lines.In the final step, the line creation step, blobs that overlap putting diacritical marksare merged together. In the same step parts of broken characters are associatedtogether. The baselines are then fitted more accurately once the text lines havebeen found.This is accomplished using quadratic splines which has the advantage that thecalculations are more stable, but it also has the problem that discontinuities couldarise when multiple spline segments were required.Once the text line has been found, the baselines are fitted.After these steps Tesseract will test the text lines to find out whether they arefixed pitched.In the case that they are fixed pitched, Tesseract is able to chop the words intocharacters using the pitch and it will disable the chopper and associator for thesewords in the word recognition step.However if this is not the case, Tesseract will handle problems that arise by mea-suring gaps in a limited vertical range between baseline and mean line.If there are spaces that are too close to the threshold, they will be made fuzzy sothat a final decision is made after word recognition[Smi07].

Word RecognitionThe word recognition phase will first classify the output of the line finding, after

which this phase only applies to non-fixed-pitch text.Tesseract will try to improve word recognition results by chopping the blob withworst confidence from the character classifier as long as the result from a word is

16


unsatisfactory, as discussed in the following paragraphs.These chops are found using concave vertices of a polygonal approximations. Chopsare then executed in priority order and any chop that fails to improve the confi-dence is reverted. Note that chops are not completely discarded, as they can bereused by the associator.It may happen that even when the potential chops have been exhausted, the wordis still not satisfactory. In this case it is passed to the associator.The associator will then perform an A* (best first) search of the segmentationgraph.This graph is not actually built, instead a hash table of visited stated is main-tained. This graph contains the possible combinations of the maximally choppedblobs into candidate characters.After this, new candidate states will be evaluated by classifying unclassified com-binations of fragments[Smi07].

Linguistic analysis and adaptive classifierTesseract does not have many linguistic analysis tools on board. When the word

recognition module is processing new segments, the linguistic module would choosethe best available word from one of the following categories:

• Top frequent word

• Top dictionary word

• Top numeric word

• Top upper case word

• Top lower case word

• Top classifier choice

The final decision for a given segmentation is the word with the lowest distancerating.Tesseract does not use a template classifier, but uses the same elements from itsbuilt-in static classifier.The difference between the two classifiers is that the adaptive classifier uses isotropicbaseline/x-height normalization whereas the static one uses first and second mo-ment for normalizations.The approach used by the adaptive classifier makes it easier to make a distinctionbetween upper and lower case.

17


Page layout analysisAs we mentioned earlier, Tesseract used page analysis software by HP in the

beginning. After it broke ties with HP, the software thus missed a key component.This was the case until [Smi09] proposed a hybrid page layout that utilized bottom-up methods to form a data-type hypothesis and locate tab-stops.Tab-stops are used to format a page, being able to detect tab-stops allows one todeduce the column layout of a page.The column layout in turn is applied in a top-down way to detect structure andreading order of the analysed section.In the preprocessing step the algorithm tries to detect line separators and imageregions and at the same time it separates connected components into text compo-nents and components for which the type is not certain.In the consequent steps the algorithm tries to find the column layout based on aset of rules using column partitions (these are created by scanning the connectedcomponents that were detected in the previous steps).Lastly the column partitions are used to find text regions and to determine thereading order.In the previous paragraphs we have discussed the open-source OCR engine Tesser-act in depth. Concerning the evaluation of Tesseract’s accuracy [Smi07] notes thatdue to its long dormant state Tesseract has now a lower position than commercialOCR engines.However, the authors of [HKP] compared Tesseract to a commercial OCR engine(ABBYY FineReader) and found that Tesseract performs much better in terms ofword accuracy than character accuracy.

OCRopy

In this subsection, we will discuss another OCR-technology that is widely used:OCropy.OCropy puts the emphasis on modularity, easy extensibility and reuse. In [Bre08]the author goes on to discuss the main properties of the OCR-software.We will discuss them shortly in the following paragraphs. An overview of the totalprocess of OCropy is given in Figure 2.7.

ArchitectureThe architecture of OCropy is actually almost the same as in Figure 2.1.

The architecture is built in such a way that there is no backtracking, the softwaremoves strictly forward and consists of three major steps:

(Physical) layout analysis In this step the software tries to identify text columns,

18


Figure 2.7: The flow of OCropy software. Figure taken from [Bre08].

19


text blocks, text lines and the reading order.

Text line recognition This step is accountable for recognizing the text in a givenline. OCropy is designed in such a way that it is able to recognize vertical andright-to-left text. Furthermore, a hypothesis graph is constructed showingpossible alternative recognitions.

Statistical language modelling The last step consists of integrates the alter-native recognitions from the previous step using prior knowledge about thethe language, text domain, vocabulary etc.

We will now shortly discuss each of these steps.

Layout analysisWe note that prior to this step there is also a preprocessing and clean-up step

in which OCropy provides the standard tools such as gray-scale and binary imageprocessing.The goal of layout analysis is to take a raw input image and then divide this intotext and non-text components.Although there is no restriction on the direction of the text, the layout analysismodule must indicate the correct reading order for the collection of text lines.OCropy contains two layout analysis tools; we have summarized the steps of eachtool in Table 2.1.The first method (Text-Image segmentation) is trivial but RAST-Based layoutanalysis requires a bit of clarification.In the first step of this method, a maximal rectangle algorithm is used to findvertical white space rectangles. Among these rectangles the algorithm selects theones that are near character-sized components on the left or right, i.e. the rectan-gles represent column boundaries.The method then continues to find text lines by matching the bounding boxes ofthe source document against a geometrically precise text line model.Lastly, the reading order is determined by comparing pairs of lines, given thereading order can be defined unambiguously from certain pairs of text.

Text-Image segmentation RAST-Based Layout Analysis

Step 1 Divide input image into candidate regions Column finding

Step 2 Features extracted from regions Text Line modelling

Step 3 Each region is classified Reading order determination

Table 2.1: The two available OCropy analysis methods and the steps taken by themethods to achieve results.

20


Text line recognitionIn the very last step the software translates the page to a collection of text lines.

These lines are then passed to a text line recognizer for the input language.Again, two methods are present in the OCropy framework.We summarize some of the characterizations of them in Table 2.2. As we havediscussed Tesseract in the previous subsection, we will not go into detail aboutthis technology.However we must note that OCropy does not use Tesseract by default as of version0.4.The developers have established prototype recognizers based on Hidden MarkovModels [Bre09]. The discussion of the second method is outside the scope of thisdissertation, which is why we invite the reader to consult [Bre08] for a more in-depth discussion.

Tesseract MLP-based recognition

Step 1Comparison between prototype

character shapes and characters in the input

Multi-layer perceptrons (MLPs)

for character recognition

Step 2 Mature OCR system Over-segmentation of input string

Step 3Character recognition error

close to commercial systems

Recognizing potential

character hypothesis

Step 4 Does not estimate character likelihoodsExpressing recognition results

and geometric relationships as graph

Table 2.2: Text line methods and the main characteristics available in OCropy.

Language modellingOCropy uses statistical language modelling in which dictionaries, character and

word-levels n-grams8 and stochastic grammars are used in order to resolve ambi-guity and resolve missing characters.To this end the module assigns probabilities to strings and matches them with themost likely interpretations.This module is completely outsourced to an open source library OpenFST whichuses weighted finite state transducers, the working of this library is out of thescope of this dissertation as well.A schematic representation of how this module works is shown in Figure 2.8.The author of [Bre09] notes that the library OpenFST does not provide an effi-cient decoding of input. Hence, in more recent versions OCropy there is an in-house

8An n-gram is a contiguous sequence of n items from a given sequence of text or speech.

21


Figure 2.8: The flow of the language model of OCropy software.

developed A* search algorithm for decoding.

Other characteristicsBefore we conclude this section, there are two more characteristics of OCropy

we would like to mention.The first one being that OCropy is, unlike other OCR-software, built in such away to recognize whole books.To this purpose, OCropy supports intermediate representation that contains re-sults of the whole process (from preprocessing until language modelling) for awhole book.This approach has a number of advantages: it allows estimation of fonts, resolu-tions, page size and other parameters for an entire book.These estimations can then in turn be used to train the software for better recog-nition.Lastly we mention the output format of OCropy namely hOCR. Instead of using aproprietary format, OCropy uses a format that is based of HTML. These HTML-pages are then augmented with OCR-specific HTML-tags.This has the advantage that HTML can show all major scripts and languages. Anexample of this output can be seen in Figure 2.9.

In this section we have discussed the way that current OCR-software worksand we have given a detailed overview of two major OCR-software applications onthe market.We would also like to note that there is other open source OCR-software availablesuch as Ocrad and GOCR.

22

2.2. SEMANTIC REASONING CHAPTER 2. RELATED WORK

Figure 2.9: An example of hOCR output from the OCropy software. Figure takenfrom [Bre08].

2.2 Semantic Reasoning

This subject can also be seen as a part of the Natural Language Processing(NLP)area which is discussed more in-depth in scientific research.In NLP one tries to analyse text and establish human-like language processingusing machines, i.e. computers.This happens using a range of (linguistic) theories as well as machine-involvingtechnologies in order to achieve human-like language processing. The latter termmeans that a machine is able to do the following [Lid01, Spy96]:

• Summarize a given text.

• Translate from one language to another.

• Derive answers on questions concerning the context of a given text.

• Make conclusions from a given text.

Although we are not interested in all of the above mentioned items, the second,thirdand fourth item are of interest to our research. To be able to translate from one

23


language to another language the first thing that has to be done is to identify thelanguage of the input text.While for the third item we are interested in answers to questions such as“In whichbook is character X mentioned?” or “Where is character X at the moment?”. Thefourth item is useful for dynamic time-lines.

The following subsections will discuss the literature in the order that an algorithmtries to understand the semantics of a given text.

2.2.1 Language Identification

The first step in order to be able to understand the semantics of a given text is tobe able to identify the language in which the text is written.This allows a given application to look further in the realm of the identified lan-guage for the semantic and eliminate confusion about the semantic meaning of agiven text.Let us the consider the following example:“Mama, die Hose!”.This German exclamation can be translated into “Mama, pants!”.If an application was to identify the language as German from the beginning, thenthis doesn’t infer any problems.However if the application would falsely identify the language as English, then theexclamation makes (even) less sense.This phenomenon is also described as False Friends [CDN02] and it shows the im-portance of language identification for the further semantic reasoning process.In general when trying to correctly identify a language, one needs a collection oftext for training, this collection usually needs something on the order of thousandsor more characters [XLP09].In cases where there are a lot of language resources available for training, algo-rithms have been designed that can achieve an accuracy rate up to 99.8 % [CT+94].The accuracy rate however can drop significantly if the available language trainingcollections are scarce.According to [XLP09] the language identification problem is made more trouble-some due to several factors such as :

• Large numbers of languages: The ISO-639-29 standard which lists codes forlanguage names consisted of more than 450 languages.

• Unseen languages: One might be handling a document which has an uniden-tified language.

9http://www.loc.gov/standards/iso639-2/

24

http://www.loc.gov/standards/iso639-2/


• Encoding issues: Some languages (such as Arabic and Mandarin) do not useRoman scripts in their writing system.

• Multilingual documents: Some documents contain more than one language.Although this is not the case in novels, this can be encountered in dictionar-ies.

Most, if not all, language identification methods use either feature-based modelsor similarity based classification and categorization [XLP09].In [HBB+06] an overview is given of these methods. We will shortly describe them.

Feature-based models Several models (such as [CT+94]) use statistical modelof character co-occurrence, while others have used Bayesian models for char-acter sequence predictions. In the last decade methods were developed whichapplied character n-gram tokenization as basis.

Similarity based classification and categorization Within this method in-formation theoretical measures of document similarities are used to identifylanguages in a multilingual environment.

Detection of the character encoding Some methods try identifying the lan-guage of the document by looking at the character encoding that is usedwithin the document.

Kernel methods These are a machine learning class of algorithms for patternanalysis. Support vector machines are one of these algorithms that can beapplied in this case.

Statistical methods The methods applied here use statistical theories (such asMonte Carlo based sampling and confidence limits) to identify the languageof a given document.

This dissertation will focus on the English language for reasons which we willclarify later on. However, we will use a language identification tool to try to detectthe language of the document that the user is reading, in order to provide correctenhancing content. Once we are able to correctly identify the language of thedocuments, the next step in NLP is usually entailment.

2.2.2 Entailment

Semantic entailment, and its close related concept co-referencing, both try to un-derstand the semantic structure of a given text.Semantic entailment tries to deduce if the meaning of one given sentence entails

25


another given sentence.For example the sentence “The president was fatally shot in the head.” entails thesentence “The president is dead” [DGM06].Co-referencing however tries to identify noun phrases (or mentions) that refer tothe same entity, for example in the sentence “Barack Obama was fatally shot inthe head. The president was brought to the hospital.” the words “Barack Obama”and “President” are co-referents [BGST10].

Semantic entailment tasks have generally been using first-order theorem andmodel generation.One notable shortcoming of these methods is that they are only able to provide ayes/no answer, which in some cases is not sufficient.The new approach in [MdR01] describes a way to get a more fuzzy answer on thequestion of entailment.To this purpose they introduce the concept of entailment score. Given two doc-uments d, d’, then the entailment score, entscore(si,d,sj,d), of two given text seg-ments si,d(in d) and sj,d′(in d’) is defined as follows :

entscore(si,d,sj,d) =∑

tk∈(si,d∩sj,d′ ) idfk∑tk∈sj,d′ idfk

In this equitation the term idfk, also referred to as the inverse document fre-quency(IDF), for the term tk defined as follows:

idfk = log(N

ni

)

The terms N and ni in the above equitation stand respectively for the total amountof segments in the documents and the number of segments in which the term tk

occurs. It is clear that entscore(si,d,sj,d) is a number between 0 and 1 that itprovides a possibility of approximate entailment.The authors note that this entailment score does not imply text similarity nor isit sufficient to have a non-zero entscore(si,d,sj,d).One should define a entailment threshold to receive a correct judgement about theentailment of two given documents.

Although we are not very much interested in extracting entailments from whatthe user is reading, we can use the last equitation to pinpoint important terms inthe text he is reading.The IDF function thus assigns a low number to terms that occur in many segments(such as the, some, a, I, etc.) while content bearing terms receive a high score.Intuitively the terms with a higher IDF-score are better suited to extract uniquecontent within a document[MdR01].

26


Using a human judge, the authors of [MdR01] were able to correctly identify en-tailments in 30 to 40% of the cases under 60 seconds.Although this does not seem impressive, it must be noted that these results wereobtained using a 600 MHz Pentium III PC and entailment problems are normallyvery computational extensive [MdR01].

Both semantic entailment as co-referencing use WordNet.WordNet is an on-line lexical collection of English verbs, nouns, adjectives andadverbs.These categories of word are linked to sets of synonyms that in turn are linked toword definitions using semantic relation links.At the moment of writing WordNet has more than 206 000 Word Forms and SensePairs10.A Form is a string ASCII characters and a Sense is represented by the set ofsynonyms that have that sense[Mil95].WordNet is widely used to deduce semantic meaning of words or to enrich thealready existing semantic definitions [AAHM00, BGST10]. However it has alsosome shortcomings. For instance relational theories of lexical semantics point outthat any word can be defined using other terms to which it is related, howeverWordNet does not encode enough semantics to support such definitions [Mil95].WordNet provides a user interface that allows users to perform searches, but italso allows the glossary to be downloaded and used via command-line11 [Mil95].

At this point, we have discussed the ability to identify the language of a docu-ment and how one can subtract the semantic noise in a document and match thesemantically significant terms to WordNet.In the next subsection we will shortly describe the literature concerning the se-mantic web, which can be used to deduce knowledge about the extracted words.

2.2.3 Semantic Web

In this subsection we will discuss the current literature about Semantic Web andto that extent also Linked Data.Linked Data is a layer on top of the normal World Wide Web. It allows to createtyped links between data from different sources.In contrast to Web 2.0 mash-ups which use fixed data sources, Linked Data oper-ates on a global, unbound data space. [BHBL09].

Semantic web is built upon principles that Berners-Lee gave as guidelines in

10http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html11http://wordnet.princeton.edu/wordnet/man/wn.1WN.html

27

http://wordnet.princeton.edu/wordnet/man/wnstats.7WN.html


[BLHL+01].These rules were meant to be used as guidelines when publishing data so thatthe published data would become part of one single global data space. The givenguidelines, also known as Linked Data principles, are[BHBL09, BLHL+01]:

• Use URIs to identify elements.

• Use HTTP URIs to elevate the availability of those elements.

• Use standards (such as RDF and SPARQL (see later)) to provide usefulinformation when people look up elements.

• Use links to interconnect elements, enabling the user to find out more aboutan element if he wishes to.

The Linking Open Data Project12 is a grass roots community which tries tobootstrap the Web of Data.It does so by identify already existing data sets which are available under openlicenses and converting them to RDF according to the above mentioned principles[BHBL09].Figure 2.10 shows an impressive collection of data sets, according to statisticsgathered by this project, the Web of Data consists of 19.5 billion RDF triples.

As can be seen on Figure 2.10 DBpedia is the largest data set and contributesmore than 3 billion RDFs (17% of the total Web of Data).We will describe DBpedia and its most important characteristics in the next sec-tion.


28

http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData


Figure 2.10: Linked Open Data cloud that shows the data sets that have alreadybeen published and interlinked. There are 570 data sets interconnected by 2909 linksets. The arcs represent two connected datasets, As darker as the arc gets,the morelinks it has to other datasets.

DBpedia

DBPedia is a community effort to extract structured information from Wikipediaand to make this information available on the Web.Combining the world’s structured information together and gathering knowledge inorder to find an answer to semantic rich queries is one of the key areas of computerscience [YBE+12].Many have approached this challenge, which has lead to the birth of the SemanticWeb and related technologies.These efforts however have never received enough traction to the general public.[ABK+07].

Traditionally ontologies and schemes are designed in a top-down manner, i.e.the semantics are developed before the data.This introduces challenges because both data and meta-data must continuouslyevolve. That is why recent research has started to develop towards the SemanticWeb grass-roots-style.

29


Such methods require a new model of structured information that should be ableto handle inconsistency, ambiguity, uncertainty, etc. [ABK+07].

The DBpedia project derives a data corpus and knowledge from Wikipedia.Most knowledge bases today cover specific domains and are mainly built by spe-cialized researchers.Wikipedia, however, has grown into one of the central places of human knowledge.The DBpedia project uses this gigantic knowledge source to structure informationand exposes this extraction to the World Wide Web.There are currently more than 2.6 million entities described by DBpedia, more-over these entities include 198,000 persons, 328,000 places, 101,000 musical works,34,000 films, and 20,000 companies.There are several benefits tied to DBpedia: It is multi-domain, multilingual andautomatically evolves as Wikipedia changes and it is accessible on the World WideWeb. [BLK+09]

For each entity, DBpedia defines a globally unique identifier that can be deref-erenced according to the Linked Data principles.This allows semantic reasoning about entities that are defined on Wikipedia. Mostarticles on Wikipedia consist of free text, however they also contain different typesof structured information in wiki-markup form.This is imposed by Mediawiki, the software which is used to deploy Wikipedia13.Wiki-markup can be used to construct different kind of elements such as infobox templates, geo-coordinates, images, links across different languages editions ofWikipedia an example of this mark-up can be seen in Figure 2.11.

DBpedia makes use of his mark-up to extract knowledge. The architecture ofDBpedia that is used to extract this knowledge is discussed in detail in [BLK+09,ABK+07].The main components of DBpedia are:

PageCollections These are an abstract representation of local or remote Wikipediaarticles.

Destinations These components allow to store or serialize the extracted knowl-edge.

Extractors Extractors take wiki mark-up and convert them to RDF triples.

Parsers When extractors are busy extracting data, the parser components providesupport by determining data types, converting values between units etc.

13https://www.mediawiki.org/wiki/MediaWiki

30

https://www.mediawiki.org/wiki/MediaWiki


Figure 2.11: An example of the info box that is shown on the Wikipedia page andthe wiki-markup which is used to generate the box.

ExtractionJobs Extraction jobs group a page collection, extractors and destina-tion into a work flow.

The Extraction manager is the core of the DBpedia framework, it handles theprocess of passing Wikipedia articles to the extractors and sends the output to thedestination.It also handles the URI management and resolves redirects between articles. Forfurther detail about the extractors and parsers, we refer to [BLK+09].

In order to keep DBpedia up to date with the evolution of Wikpedia, the frame-work uses two workflows.A dump-based extraction and live extraction. In the latter case DBpedia useslive updates from Wikipedia to update its content using granted access to theWikipedia OAI-PMH live feed.This feed reports all Wikipedia changes.In the former case the SQL dumps of 30 Wikipedia editions, which are publishedon a monthly basis, are analysed and fed to the extraction manager.In both cases the resulting knowledge is made available as Linked Data, for down-

31


load and via DBpedia’s SPARQL endpoint. The architecture of DBpedia can beseen on Figure 2.12.

32


Figure 2.12: The DBpedia architecture. Note that not all parsers are present onthis figure

33


We denote that there are many other sources of Linked Data such as Yago,Freebase, etc. However we chose to discuss DBpedia given it is the biggest of themall[BHBL09, BLK+09, ABK+07].One can also search for these semantic rich resources using search engines likeSindice.Sindice is oriented towards providing access to the documents containing instancedata [BHBL09, ODC+08].At this point we have discussed where one can retrieve semantically rich entities.In the next section we will discuss on how this data is structured and how we canapproach it.

RDF, OWL and SPARQL

Semantic Web heavily relies on data that is defined using RDF (Resource Descrip-tion Framework).RDF encodes data using the triples subject, predicate and object. It allows to ex-press meaning rather like a sentence with a subject, a verb and an object.The subject and object can be presented in two ways. Either they are both URIswhich identify a resource or they are represented using a URI and a string literalrespectively.The predicate defines the relation between the subject and object also using a URI[BHBL09, BLHL+01].

For example an RDF triple can state that two people, Bilbo and Gandalf, bothidentified by a URI are related by the fact that Bilbo knows Gandalf.Similarly an RDF can express that a given monument can be found in a city. Tworesources linked in this fashion do not necessarily need to reside in the same dataset.This way of linking thus allows two data types from a heterogeneous datasets tobe connected and henceforth creating the Semantic Web [BHBL09].

One can use RDF Vocabulary Definition Language (RDFS) and Web OntologyLanguage (OWL) to describe entities in the world and how they are related.Vocabularies are expressed in RDF and define collections of classes and propertiesusing terms from RDFS and OWL, which provide varying degrees of expressibility.Vocabularies are not restricted or governed by any instance, everyone can publishthem to the Web of Data, these can be in turn connected by RDF triples that linkclasses and properties from one vocabulary to another one.There are many classes available such as FOAF, SIOC, vCard, Dublin Core etc.[BHBL09, MVH+04].

Given that RDF is a direct, labelled graph, it allowed the definition of theSPARQL query language for RDF.SPARQL allows to express queries across diverse data sources, whether the RDF

34


data is stored locally or viewed via middleware.This allows the creation of SPARQL endpoints, which are able to query semanticdata sets. It also allows to specify required and optional graph patterns along withtheir conjunctions and disjunctions.The outcome of a SPARQL query can either be result sets or RDF graphs [PS+08].Moreover SPARQL allows the creation of systems which are able to answer seman-tic questions such as “In which book does character X play?”.In [YBE+12] a system called DEANNA is created, which translates naturally askedquestions into SPARQL queries, which in turn can be evaluated over knowledgebases such as DBpedia, Yago, Freebase or other Linked Data resources.

2.2.4 Available tools

In this section we will briefly discuss which tools are available to support the tech-niques mentioned above. Many of these tools can achieve the same goal.For example TextCat14 implements the ideas that are presented by [CT+94] inorder to identify languages. However, there are also tools, which we will discussshortly, which do that and more.As mentioned in the previous section, these algorithms need a certain amount ofdata in order to work correctly. TextCat is able to identify 76 languages15. Inorder to use TextCat, the program must be downloaded. It can either be run as abinary via command-line or by using a web interface.TextCat is distributed under GNU GPL v2.1, which means it is free software andthe source code is available and can be adjusted.Another tool that is commercially available is the AlchemyAPI. There are freeplans that allow light usage of the service.AlchemyAPI offers a wide range of text analysis functions, which we will shortlydiscuss. Once a document is fed to AlchemyAPI, it will try to do the following[Tur13]:

Language identification AlchemyAPI is able to recognize over 100 languages.

Keyword extraction Alchemy tries to retrieve the words from the documentthat contribute to the topic of the text.

Concept extraction Concept extraction is very much like keyword extraction,although concepts do not need to be mentioned explicitly. For example, ifa given text speaks about two wizards Harry and Voldemort, the keywordextraction will identify them, but the concept extraction will identify theconcept magic instead.

14http://odur.let.rug.nl/~vannoord/TextCat/15http://odur.let.rug.nl/~vannoord/TextCat/list.html

35

http://odur.let.rug.nl/~vannoord/TextCat/

http://odur.let.rug.nl/~vannoord/TextCat/list.html


Entity extraction In this module AlchemyAPI tries to look explicitly for entitiesthat are defined on Linked Data sets such as DBpedia, YAGO or Freebase.This is a very important feature that can be a point of interest to this dis-sertation.

Another tool that offers the same variety of functions is OpenCalais. OpenCalais,however, has some shortcomings as it can only process three languages: English,French and Spanish. For the supported languages OpenCalais will try to identifyentities, events and facts [But09]. For example, consider the following (fictional)excerpt:“Gandalf was appointed as head of the fellowship of the ring. He would first visitSaruman in Isengard, Middel-earth to seek advice.”Within this excerpt OpenCalais should be able to deduce entities (in this exampleOpenCalais was able to identify only Gandalf), events (OpenCalais was able tocorrectly deduce that Gandalf was appointed head of the fellowship) and facts.

An example of these three technologies can be found in Figure 2.13. We usedthe excerpt mentioned above in all three cases.

36


(a) TextCat HTML viewwhich returns the identi-fied language of a pastedtext. The language ofthe excerpt was mistak-enly identified as Scots.

(b) The alchemyApiservice, showing theconcept extraction tool.AlchemyAPI shows verypromising results.

(c) The OpenCalais ser-vice.

Figure 2.13: Set of tools which can be used for semantic reasoning.

37


In this section we have described some key concepts of NLP and the SemanticWeb. We have shown the structure of the biggest Linked Data provider and howwe can approach it. We have also discussed some of the available tools.As we have mentioned before, the tools are scarce for the general public and areoriented more towards developers and researchers.Among the tools that we have presented here, we are mostly impressed by Alche-myAPI. It was able to accurately extract the correct information from a given text.Its language support is also unlimited. Furthermore, it has also been able to linkwell with DBpedia and other Linked Data providers. On the subject of the Se-mantic Web we may conclude that, although the availability of tools to query theSemantic Web are quite scarce for the public, Semantic Web is the future for thecurrent Web as we know it.It not only allows content to be consumed by humans, but it also allows machinesto reason with the data available, which enables a more feature-rich applications.In the next section we will take a look at how current literature manages the realestate problem.

38

2.3. REAL ESTATE CHAPTER 2. RELATED WORK

2.3 Real Estate

There has been a lot of (on-going) research on how the screen real-estate of amobile device can be managed. The main issue with mobile devices as well aswearables is the amount of information that has to be displayed given the littlespace that is available.The way an application handles this problem affects the user’s perception and theway he is able to handle complex tasks on small-display devices [CK04].Furthermore, according to [Chi06] there are more peculiarities that cause issueson mobile devices. They are listed here below:

• The used ratio is very different from the usual 4:3 width/height ratio.

• The hardware within the device is still not on par with their desktopcounterparts.

• The input sensors and/or devices are not always suited for complextasks. They are also very diverse (ranging from one-hand thumb-based inputto point-and-tap stylus) and very different from the traditional mouse andkeyboard.

• The connectivity of a mobile device (and certainly wearables) is a lot morevariable, which can cause issues for applications that heavily rely on dataexchange.

• Mobile applications are used in a very variable environment as well. Forexample, an application has to perform in lighting conditions ranging fromtotal darkness to extreme sunlight.

• Using mobile devices while multitasking (e.g. walking) also implies thatthe user has less attention for the application, making it a secondary task.

Some of these issues are unlikely to disappear in the future, given that mobiledevices have to stay compact.However, there are different techniques to manage some of these problems, rangingfrom interface manipulation to gesture based systems. In this section we willshortly discuss existing techniques that should enhance the user experience onsmall displays.

39


2.3.1 Design principles

One possible way of mediating the problems with small screens is of course buildinga completely new application using some principles that maximize screen usage.In this subsection we will give an overview of these principles. The authors of[Chi06] define a set of questions, which should allow a designer to make the rightdecision when designing a mobile application.We will give an overview of these questions here.

How is information visually mapped? When designing an application, data(such as strings) is transformed into graphics characterized by visual features.It is important that a mapping is established in order to make conceptuallyimportant aspects equally perceptively important.

Which data is to be selected to be shown to the user? It is important toselect the right information, but it is equally important to select the rightamount of data to be shown. Burdening the user with unnecessary data willresult in a more complex system to reason with.

How is the selected data presented on the available display space? Evenwith a clear plan to visualize information, one will observe that the displayis too small to show everything. The presentation problem is one of the mainissues of this dissertation.

Which tools allow the user to interactively explore and rearrange the visualisation?Providing interactive tools to manipulate the visualisation increases the en-gagement of the user.

Are human capabilities (both perceptionally and cognitively) taken into account?Users should be able to browse quickly through visualisations and gatherknowledge from it.

Is the visualisation tested on users? Visualisations should always be thoroughlytested on users and feedback should be used to adjust the design.

As stated before, the presentation problem is the issue we need to investigatefurther. Let us consider a real-life example as shown in Figure 2.14.In this example a user is interested in a set of POIs (Point Of Interest) on a mapthat are shown in Figure 2.14a after he has set his preference the POIs are prunedas shown in Figure 2.14b.If the user were to drive into the city towards one of the POIs and zoom into onesection of the map, as can be seen in Figure 2.14c, he would lose the detail of thewhole map making it more difficult to navigate.

40


(a) A list of all the POIs

(b) A list of POIs thatmatch the users prefer-ence.

(c) A detailed view of onethe selected POIs

Figure 2.14: The presentation problem as described by [Chi06]. Figures takenfrom the same paper.

The solution to the presentation problem can be categorized in one of the fol-lowing four classes:

Overview+Detail This approach provides two separate views simultaneously:One to give the context (in a smaller view) and one for the detail (in thebigger view). This can be seen in Figure 2.15a.

Focus+Context This is similar to the previous approach, without separatingthe views from each other. Fisheye is an example of this approach, which isdiscussed in the next section.

Visual references to sections outside the detail view This can be seen inFigure 2.15b. With this approach, the user is made aware of other compo-nents that are not shown in the detail view using a visualisation (in the caseof the figure a halo-like figure).

Intuitive ways of switching between parts of visualisation This approachuses classic approaches (described in the next section) such as zooming andpanning, but tries to reduce the cognitive complexity of a task.

41


(a) The solution presented with the givendetail as well presenting the overview in asmaller window.

(b) Another solution which uses visual hintsof solutions outside the scope of the detailview.

Figure 2.15: Some of the solutions of the presentation problem as described by[Chi06]. Figures taken from the same paper.

Furthermore, [Chi06] describes five common objects (text, pictures, maps,physical objects and abstract data) and design guidelines for them. We will discusstwo of them (i.e. text and pictures) that are important for this dissertation below.

Text Small screens allow a small quantity of text to be shown. Dynamic textpresentation can be a solution, e.g. text can be horizontally scrolled over, orsmall chunks can appear in a sequence at a single location.

Pictures A similar approach of dynamic text can be used here, where chunks ofa picture are shown at a single location. This technique can be good forpictures of people, but tends to be ineffective on general purpose pictures.

Another way of modelling is presented in [EVP00]. It uses a declarative modeldefined in the MIMIC language. This model has three major components: theplatform model, presentation model and task model.The platform model describes the different kind of devices that can be used to run

42


an application. It includes information about constraints opposed by the systemsuch as screen size.The visual appearance of the UI on the other hand, is defined by the presentationmodel. This model can include information about the hierarchy of windows andwidgets.Lastly, the task model represents the different tasks that the user may want toperform and additional data about subtasks or goals can be found in this modelas well.These models are built using two objects: Abstract Interaction Objects (AIOs)and Concrete Interaction Objects(COIs). The difference between these objects isthat Abstract Interaction Objects cannot be executed. It is also a platform andimplementation agnostic object, making it highly portable.Concrete Interaction Objects are executable interactors on a specific platform. Atthe same time, these objects are tied together as AIOs are usually the parent ofseveral CIOs.In order to adapt to platform constraints such as screen resolution, the objectshere above need to be adjusted. One way is to shrink the interactors, while stillfulfilling usability constraints related to the AIO type, i.e. the length of an editbox can be reduced to a minimum.Another approach might be to replace the interactor with a smaller alternative,e.g. a boolean checkbox takes less place than a pair of radio buttons.A presentation model consists of multiple AIOs (which become composite AIOs)and the task model. Together they can define a presentation model for each plat-form. AIOs can be removed, moved or added to a presentation model, in order toknow which is the correct AIO in a presentation model, decision trees can be used.They are easy to generate and easily understood by people, as demonstrated inFigure 2.16.

43


Figure 2.16: An example of a decision tree that allows to choose an appropriateAIO. Figure taken from [EVP00].

In [CH08] another set of design guidelines are given, based on empirical testsperformed by participants, which we will summarize here below:

Design for landscape presentation Mobile devices are typically in portrait po-sition. Designing in landscape mode offers more flexibility and was preferredby the participants.

Minimize scrolling The participants all favoured no scrolling or at least a min-imum of scrolling.

Design for short and task-focused interactions It is important to realize thatwhen using a mobile device, one is implicitly dividing his attention betweenmultiple tasks. For the best results it is suggested to design applications insuch a way that actions are temporally short and that goals can be achievedwith a minimum amount of interactions.

One step interaction The user should receive immediate feedback from inter-active elements (such as buttons, sliders etc.) e.g. a click on a “view”-buttonshould instantaneously update the whole view with the desired result. Thiscan be achieved by using a minimum of visual effects such as changing fontcolours or playing sounds.

Another possibility of handling the issues with the small screen space, is ma-nipulating the interface that is shown to the user. We will discuss this in the nextsubsection.

44


2.3.2 Interface manipulation

With the advent of WWW to mobile devices, different techniques have been de-veloped to show large documents on the small screens. Among these are fisheye,zooming and panning.Panning is the easiest system, in which the user can scroll and pan through theglobal overview.The fisheye technique shows a global overview and allows the user to zoom ona particular section of the global document [Fur86]. Zooming on the other handallows to zoom in particular section but obscures the global overview, i.e. onlythat section is visible to the user.An overview of these methods is presented in Figure 2.17.The authors of [GF04] made a comparative study of the three methods (panning,zooming and fisheye). The study did not use any hand-held devices, instead thesetup contained two displays. One of them was a large display (with a dimensionof 1600×1200) and the second was a smaller one (with a dimension of 1024×768).The large display showed the source interface while the smaller one was used toshow a small-display version of the interface.The subjects had to follow a specific set of instructions for three different kindof tasks: they had to edit a document, navigate the web and monitor a controlpanel. The tasks were randomly assigned to the subjects, the results concludedthat fisheye views and zooming interfaces are effective approaches for navigating.Whereas panning techniques perform poorly and are not liked by users, it wasslower because it took time to find the correct location. At the same time theylost events that were happening outside the viewport. Moreover, the study statesthat an overview of the entire interface is important for navigation.

45


(a) The global document that isbeing viewed.

(b) Using a fisheye, the globaloverview is kept, and a particularsection is put under a magnifyingglass (fisheye)

(c) When zooming, this particular sectionis visible, but the user is not able to see theglobal information

Figure 2.17: Interface manipulation methods zooming and fisheye

However, when a user pans and zooms in large documents, a small movement ofthe scroll handle might cause sudden jumps to distant location that could disorientand frustrate the user.Furthermore, scrolling should be avoided according to [CH08]. One way to remedythis without losing the global context is by using Speed-Dependent AutomaticZooming (SDAZ).With this approach the user is still in control of the scrolling-speed, but the systemwill automatically adjust the zoom level so that the speed of visual flow across thedisplay remains constant. This means that if the user scrolls fast, the systemzooms out and zooms in when the users scrolls more slowly [EMS04].Furthermore, the authors in [EMS04] explain that by using the input sensors ofa mobile devices such as the accelerometer, one can save up space on the mobiledisplay (there is no need to show scrollbars) and at the same time provide a smooth

46


user experience. Using the tilting of the device as input and state space models,an application was made that implemented SDAZ and could navigate throughdocuments. Users were given a version of the application with and without SDAZ.Users of the application with SDAZ expressed their satisfaction with the zoominglevel and the automatic mechanism while users of the application without SDAZstated that they would appreciate extra controls to zoom in or out on the display.

Another way to approach zooming is explained in [BXWM04]. The authorspropose a new technique called collapse-to-zoom.With this technique a thumbnail view is shown whenever the content of a web pageis too big for the display. After this step the user is able to use gestures, which wewill discuss later on, to collapse columns or sections he deems irrelevant.When the user performs the gestures to collapse a section, the other sections arere-rendered to provide greater detail and sections can also be re-expanded in casethe user wants more detail.The authors claim that this reduces the need for zooming and that the user willbe able to explore content more precise and efficiently.This is achieved by allowing the user to continuously progress towards the desiredcontent instead of iterating between overview and detailed view. This techniquealso reduces the need for scrolling due to the re-rendering.As we previously mentioned, the technique uses a set of gestures to collapse orexpand sections, which is dubbed as the marquee menu by the authors.Marquee menus are based on marquee selections and marking menus. The selectiondefines a rectangular selection enclosing the start and end point of the drag gesturewhile the menu provides command selection based on the directional gesture.The marquee menu combines the selection and command execution in one singlegesture. The command that is to be executed is determined by the direction of thedrag movement. Vertical up movements will expand the selection, while verticaldown will collapse them.Expanding an area will collapse everything outside the selected area. Diagonalmovements at the top-right and bottom-left will collapse or expand a given columnwhile the opposite diagonal movements will provide the function, but on a morefine-grained scale. Figure 2.18 gives an overview of the different commands andtheir corresponding gestures.

47


Figure 2.18: The marquee menu gesture commands as presented by [BXWM04].Figure taken from the same paper.

2.3.3 Customized widgets

In the previous sections we described two possible ways to overcome the smallscreen size. A third option is to use specially designed widgets to maximize screenusage. We will shortly highlight some of the techniques that are described in theliterature.

In [FSM98] a widget called bullseye menu is presented, which is a more detailedversion of the Marking Menus described in [KB94].This menu consists of a series of concentric circles which are divided into sectors.The space between two sectors is called a ring and each ring is assigned an index,which is illustrated in Figure 2.19.In order to select commands the user has to drag the pointing device into the rightdirection until it has reached the ring in which the command resides, e.g. to closea file (as in Figure 2.19) the user has to drag its point device until it has reachedring three.

48


Figure 2.19: An example of the Bullseye menu as presented in [FSM98]. In thisfigure the “Copy” command is selected. Figure taken from the same paper.

The bullseye menu comes in two flavours, a visual and a non-visual one. In thevisual version, the menu will be displayed as a response to an UI event caused bythe user (such as touching the screen).Once the visual representation is rendered, the user can locate the target selection,move the cursor to it and perform the action selection (by dragging as explainedhere above).The non-visual version does not display a menu but instead sends out a non-visualcue signal (e.g. sounds or vibrations). The user can then directly navigate tothe correct ring, and will receive a non-visual signal every time he crosses a ring.For instance, if the user knows that “Close” is in the third ring, he can touch thescreen and drag the mouse into the correct direction until he receives three signalsto select the command.Using empirical tests on twelve subjects, the author found that the mean time forselection using the non-visual bullseye menu follows a linear function of the ringnumber rather than Fitts’ law16. This finding indicates that the user’s performanceis similar when enhanced using non-visual feedback compared to visual feedback.Note that the experiments also showed that basic non-visual feedback was not onpar with visual feedback, hence the need of enhanced non-visual feedback, e.g.communicating the dragged distance progressively by means of various qualities ofsound such as frequency and timber.

Magic Lens is another approach described in [BSP+93]. Magic Lens is a filter-ing tool that, once it is put over a surface, will transform the presentation of theobjects beneath it. It is, for example, possible to put a lens on a page of a book,

16Fitts’ law predicts that the time the user needs to move to a target area is a function of theratio between the distance from the starting point to the target and the width of the target.

49


after which the words will represent actions that can be performed by the user.

Another customized widget is described in [KEH+96, HIVB95]. Given thatthese two methods are very similar, we will discuss [KEH+96]. The authors suggestto use semi-transparent widgets. The research argues that the use of widgets witha variable transparency can help to maximize usable screen space, i.e. by adjustingthe transparency of a widget and at the same time upholding the legibility of thecontent beneath it. When implemented, it should be possible to display the contentand widget simultaneously without sacrificing screen space for the widget.There is a caveat however, as it might not be clear to the user which of the twolayers (content layer and widget layer) he is interacting with. This is illustrated inFigure 2.20.

(a) Using opaque widgets. At any givenpoint in time, it is clear to the user withlayer he is interacting.

(b) The semi-transparent makes it muchmore difficult to distinguish which layer oneis working with.

Figure 2.20: Caveat when using semi-transparent widgets as opposed to opaquewidgets. Figure taken from [KEH+96].

One possible way to resolve this problem, is to disable the ability to select theunderlying objects. However, this might confuse and frustrate the user, who isexpecting to be able to manipulate visible objects.The authors propose to detect the layer receiving user actions based on variationsin the duration of interaction, i.e. the length of time the users is engaged with aregion of the physical screen will determine which virtual layer will receive eventsto process.By empirical studies, the authors set the transparency of the text to 20% and the

50


transparency of the widgets to 80%. Furthermore, test subjects were to performsome basic actions on a web page such as selecting hyperlinks that are overlappedby widgets etc.The subjects had to perform the actions six times. Three times they did so usinga prototype that would favour hyperlinks when they overlapped with widgets andthree times with a prototype favouring the widgets instead.Furthermore, the test used response delays: if the user held the selection for longerthan a predefined amount of time, the selection would be passed to the underlyinglayer).The response delays were enforced during the tests. This means that if the userwere to be in link-first (links are favoured) group, he had to hold the selection forthe predefined time in able to select a widget, even if the widget did not overlapwith any text.The results showed that the subjects preferred the link-first method, although theerror rate for both methods were very similar. The subjects also expressed thatthey would rather not have the response delay if the widgets and content did notoverlap.

2.3.4 Sonically enhanced systems

There has been a lot of research on how to save screen space and UI can be nav-igated using gestures and sounds [BC99, BM00, B+99, Bre02, BLB+03]. We willsummarize the main concepts of these approaches in this subsection.

In [BC99, B+99, Bre02] an application was built in which users had to enter aset of numbers to match a given target number.They had to do so using on-screen buttons which came in different sizes. Theapplication can be seen in Figure 2.21.The buttons came in two different sizes, a large variant of 16 × 16 pixels and asmaller one with 4× 4 pixels. Each of these sizes had three different versions: onewithout sound effects, one using basic sounds and one using enhanced sounds.The basic sound consisted of a keyclick when the screen surface was released aftera button was successfully selected. The enhanced sound would use the basic soundbut would also add sounds when the user hit a button and when he was slippingoff a button.These studies found that enhanced sounds were in the worst case as efficient asthe basic sounds. They allowed buttons to be reduced in size while keeping theusability high.

51


Figure 2.21: On the left sides one can observe the smaller (4x4 pixels) buttons,while on the right the bigger (16x16 pixels) buttons are used. Figure taken from[BC99].

Another approach is tested in [BM00], where the authors made use of Sound-Graphs. SoundGraphs are an alternative way to present line-graphs to blind indi-viduals.This method uses sound pitches to present information. The stream of sound pitchis mapped to the Y-axis, while the time is mapped to the X-axis.A test panel had to perform two tasks. In the first one, they had to look for infor-mation in a large set of data. While gathering this information, they would earnstock, which they could sell, the best moment to sell was shown in a graph.The second task was to monitor this graph in order to decide when to sell thestock. However, these tasks had to be done simultaneously.The study found that SoundGraphs allowed the subjects to concentrate on thedata-gathering task as well as same time observing changes in the line-graph andultimately deciding when to sell the gained stocks. This suggests that Sound-Graphs would be a perfect fit to show dynamic information without hogging upscreen space.

Sounds, in combination with gestures, can also be used to navigate throughmenu items without having to use screen space or requiring the user to divideattention between the navigation and other activities.In [BLB+03, LB03] the authors use 3D-sounds and head gestures to navigatethrough the menu items. 3D-sounds allow a sound source to appear as if it isbeing played from anywhere in space around the individual.The system placed a (virtual) bullseye (as described in [FSM98]) around the userwith the user’s head as the heart of the bullseye. This is demonstrated in Figure2.22.The users could select a menu item by nodding towards the direction of which the

52


sound was coming from.

Figure 2.22: The (virtual) bullseye as presented in [BLB+03], every section repre-sents a menu item that can be selected. Figure taken from the same paper.

In order to have the nodding registered correctly, the authors built a simplerecognizer with empirically defined values. In order to be registered as a nod, thehead had to move more than 7°within a timespan of 600 ms.Otherwise it would either time-out(gesture took more than 600 ms) or it wouldbe considered as a normal head-movement. The authors used two kind of sounds:egocentric and exocentric. Using egocentric sounds, the sounds are places at every90°angle from the user’s nose and the sounds will remain fixed with respect to thehead when the user is turning towards a certain sound.Exocentric sounds on the other hand, are arranged in a line in front of the users’head and the user can select them by slightly rotating the head until it is facingthe desired sound followed by a nod. Users had to use a wearable device and walkthrough an obstacle course while selecting menu items simultaneously. The resultsindicated that egocentric sounds enable the use to accurately select options whilewalking.This suggests that head gestures combined with sound can enable users to navigatea mobile (wearable) device without having to look at the screen at all.

53

3Setup

In this chapter we will explain the setup of our experiments and the implementationof the solutions we have devised.For this dissertation we have used the following hardware:

Element Specification

Brand Apple

Model MacBook Pro (Retina, 15-inch, Mid 2014)

RAM 16 GB 1600 MHz DDR3

Processor 2.2 GHz Intel Core i7

Table 3.1: Hardware used for tests and development

On the above mentioned device a virtual machine was setup running Ubuntu15.04. We have chosen Ubuntu due to the general availability of the software weneeded throughout this dissertation. This virtual machine runs a web-server thatserves a Python application. The Linux distribution was armed with the softwareas described in Table 3.2.

54

3.1. REST-WEBSERVICE CHAPTER 3. SETUP

Software/Module Version Used for/as

Apache 2 2.4.10Web-server, serving

REST-application (see further)

Tesseract 3.03 One of the chosen OCR-software

OCRopy(formerly known as OCROpus) 0.2 Second chosen OCR-software

Python 2.7 Necessary for Flask

Flask 0.10.1 Flask is a micro-framework that allows to serve python REST-apps more easily1

BeautifulSoup 4.3.1 Used to convert OCR-results to plain text

Alchmey API N/A Used to contact the Alchemy API NLP service

DiffMatchPatch N/A Used for testing the OCR-software (see further)

httplib2 0.9 Python HTTP library used to contact OpenCalais NLP service

mod wsgi 4.3.0 Used to expose Flask-application

ImageMagick 6.8.9 Manipulate client images

Android SDK 22 Android SDK was used to build the client software

Ion 2.1.6 Android library to ease the downloading of resources.

Table 3.2: Software deployed and used on the Ubuntu virtual machine.

As mentioned in Table 3.2 Flask is used to create a REST-application, whichresponds to client-calls. In order to expose the flask application via Apache2 tothe outside world, it is necessary to setup WSGI2 as well as to configure Apache2.A preliminary guide on how to setup Flask using Apache is provided in Ap-pendix A. In the following sections we will describe the architecture of our REST-webservice, the software for the OCR, the NLP-services and then client software.

3.1 REST-webservice

For the web service we have opted for the client-server architecture (henceforthreferred to as CSA). This setup is illustrated in Figure 3.1a. This structure allowsthe client to stay lightweight and at the same time enables complex operationssuch as OCR and shell commands.The use of a CSA has the added benefit that it enables cross-platform develop-ment. Indeed, by providing the complex operations as a service, it shifts the focusof an application towards design and functionality rather than implementing diffi-cult functions such as OCR logic.

1“The “micro” in micro-framework means Flask aims to keep the core simple but extensible.By default, Flask does not include a database abstraction layer, form validation or anything elsewhere different libraries already exist that can handle that. Instead, Flask supports extensionsto add such functionality to your application as if it was implemented in Flask itself”

2The Web Server Gateway Interface was first mentioned in PEP 333, it describes an interfacebetween web servers and web applications or frameworks for the Python programming language.

55


(a) The general structure of the webservice.

(b) The processing of theclient’s request. Accord-ing to the client’s pref-erence first the choice ismade between OCRopusor Tesseract ((a)).The output of the OCRis then channelled to thepreferred or default NLP-service to extract entities((b))

Furthermore, by transferring the complex operations from the client to theserver, the client can remain battery friendly. Let us take a closer look at thearchitecture and its internal working. A typical use case starts with the user hav-ing its camera-equipped device pointing at a book. The client application that isrunning on the device will take a picture and send it to the server for processing.

56


The web service, which uses REST as ode of operation, expects an HTML-formthat contains the image.

Within the same form, the client can also express some preferences in respectto the OCR-software and NLP-service used for processing (see further). Theseoptions are shown in Table 3.3

Name Possible values Required

image any usual image (JPEG, PNG, etc.) Yes

ocr ‘ocropy’ or ‘tesseract’ No

calais ‘true’ or ‘false’ No

Table 3.3: Possible fields of the HTML-form sent from the client.

Once the image and the form arrive at the web service, it is processed. Thisprocessing includes saving the image locally, setting which OCR-software to useand which NLP-service to call, according the preferences passed from the client.This is shown in Listing 3.1.

57


Listing 3.1: The Python code of the function handling the processing of the HTMLform

1 @app.route("/getResources", methods =["POST"])

2 def getResources ():

3 # get the image from the form

4 f = request.files[ ' image ' ]5 # save it in the predefined folder

6 f.save(join(app.config["UPLOAD_FOLDER"], f.filename ))

7 # extract the filename and extension

8 # set default OCR -software

9 command = app.config["APP_PATH"] + "ocropy"

10 # change it if the client has set another preferred OCR -software

11 if request.form[ ' ocr ' ] == ' tesseract ' :12 command = app.config["APP_PATH"] + "tesseract"

13 ...

14 # Call the appropriate NLP -service ,

15 # according to the client ' s preference

16 if request.form[ ' calais ' ] == ' true ' :17 output = call_calais(html_doc)

18 else:

19 output = call_alchemy(html_doc)

If the client only sends an image (without preferences) the default options forOCR and NLP are taken (in our case OCRopy and AlchemyAPI, see further).In the second step the web service will pass the image to the OCR-software, asexplained before we have chosen for OCRopy and Tesseract as possible choices.For this purpose we have created two shell scripts, scripted in BASH-bourne shell.Both scripts are quite similar, we will explain one of these shell scripts in detail insection 3.2.

At this point the web service has access to the OCR result of the image whichwill be sent to the chosen NLP-service. Again, the client can choose between twooptions: Alchemy API and OpenCalais.The former can be contacted using the provided python API3 while the latteris contacted using a self-written python script, in-depth explanation is given insection 3.3.Both services return JSON-formatted responses which can be interpreted in thelast step. These steps are illustrated in Figure 3.1b. In the last step, media isselected from different channels and is then sent to the device.

3http://alchemyapi.com/developers/sdks

58

http://alchemyapi.com/developers/sdks

3.2. OCR-SOFTWARE CHAPTER 3. SETUP

3.2 OCR-software

In this subsection we will describe the developed bash scripts to perform OCR ona given image. We have developed two scripts for this purpose. One for OCRopyand one for Tesseract, although these bash-scripts are similar and could easily bemerged into one script, we have kept them separate for clarity’s sake.Both scripts are shown in full in Appendix B. We will discuss the OCRopy scriptin more detail in this section.In the first steps, the script sets some basic variables that can be passed to thesoftware in order to be flexible in respect to the location, file format, file name etc.The options that can be passed to the scripts are described in Table 3.4.

Option Meaning

w Set the working directory, i.e. where the client image resides.

o Set the output file name.

f Set the output folder, i.e. where the OCR-result is saved to.

iSpecify the image name to be processed by the OCR-software.

Without the extension.

e The extension of the image.

t (only OCRopy)Set the temporary folder.

OCRopy requires a temporary folder in order to process images

Table 3.4: The possible options that can be passed to the shell scripts

As seen in the previous chapter, OCR-software requires a minimum DPI inorder to processes images. In our case, a minimum resolution of 600 pixels by 600pixels has been set (this is dictated by OCRopy).Hence in the second step the script calculates the resolution of the image thatis to be processed and as long as the resolution of the image is below the giventhreshold, its size is doubled.This code can be seen in Listing 3.2

Listing 3.2: Functions that calculate the resolution of the given image size andresize if needed by doubling the size each time.

1 # Calculate the size of the provided image

2 function calculate_height_width ()

3 {

4 height_width=`convert $imageName$extension -format "%h$%w" info:`5 if [[ "$1" = "height" ]]; then

59

3.2. OCR-SOFTWARE CHAPTER 3. SETUP

6 height=`echo $height_width | sed ' s/$[0 -9]*$\$$[0 -9]*$/\1/ ' `7 else

8 width=`echo $height_width | sed ' s/$[0 -9]*$\$$[0 -9]*$/\2/ ' `9 fi

10 }

1112 calculate_height_width "height"

13 calculate_height_width "width"

1415 # if the calculated height below the threshold ,

16 # double its size using ImageMagick

1718 while [[ $height -lt 600 || $width -lt 600 ]]; do

19 convert $imageName$extension -resize 150% $imageName$extension



22 done

Finally, the actual OCR-steps are executed. While Tesseract just takes theimage and processes it internally, OCRopy requires to execute the aforementionedsteps of OCR manually as shown in Listing 3.3.

Listing 3.3: Functions that calculate the resolution of the given image size andresize if needed by doubling the size each time.

1 ocropus -nlbin -n $imageName$extension -o $tempFolder

2 if [ $? -eq 0 ]; then

3 ocropus -gpageseg -n --minscale 5 "$tempFolder /????. bin.png"

4 else

5 exit 2

6 fi

78 if [ $? -eq 0 ]; then

9 ocropus -rpred -n "$tempFolder /????/??????. bin.png"

10 else

11 exit 3

12 fi

1314 if [ $? -eq 0 ]; then

15 ocropus -hocr "$tempFolder /????. bin.png"

16 -o $outputFolder$outputFile

17 echo "All good"

18 else

19 exit 4

20 fi

60

3.3. NLP-SOFTWARE CHAPTER 3. SETUP

The first step, in line 1 of Listing 3.3, is binarisation; the image is convertedfrom grayscale to black and white. In this step, OCROpus also estimated the skewand tries to get the text in the image as horizontal as possible.

The next step is to extract individual lines of text from the image, which isdone in line 3. As described in the previous chapter, OCROpus does this by esti-mating the scale of the text. This is done by finding connected components in theimage that were generated in the previous step.

In line 10 the software does character recognition by using Neural networks.Lastly, in line 17, all the recognized lines are saved into one file with the hOCRextension and structure.

3.3 NLP-software

This section provides a short walk-through of the used NLP-services. Our de-signed webservice can be used with AlchemyAPI and OpenCalais. Both are REST-services which can be called using regular HTTP. Both these services are propri-etary and require an API key in order to operate.

3.3.1 Alchemy API

Alchemy API provides SDKs for several general purpose languages such as PHP,C#, Java etc. For our service we used the available python SDK.The SDK is just a wrapper around their existing web API. In order to setup thisSDK, it is necessary to clone the git repository of the company, followed by navi-gating into the the folder where the code was cloned and then execute the followingline:python alchemyapi.py YOUR_API_KEY

This sets up the SDK and makes it ready to use. Our service only uses one call ofthis SDK. Namely the entities call. This call, given a piece of text, extracts allthe entities it can find in the text. It returns the JSON-object from the API andconverts it to a Python object before returning it.The API key can be requested on their website4 and it allows 1,000 daily transac-tions.

4http://www.alchemyapi.com/api/register.html

61

http://www.alchemyapi.com/api/register.html

3.4. CLIENT SOFTWARE CHAPTER 3. SETUP

3.3.2 OpenCalais

OpenCalais has recently been acquired by Thomson Reuters. One can obtain anAPI key from their new website 5. This allows the developer to upload 5,000 textdocuments to the API for analysing. The API-calls need to be made to a givenendpoint using an HTTP POST request. The request has to have at least thefollowing fields in the header:

1. Content: Indicates the input mime type.

2. x-ag-access-token: The api key.

Additionally the following options can be specified:

1. outputFormat: Defines the output format.

2. x-calais-language: Indicates the language of the input text

Unlike Alchemy API there is no SDK available for OpenCalais. For this pur-pose, we have created a simple python script that uses the httplib module forPython. This module allows us to set the proper headers and body and send therequest to the OpenCalais servers.

3.4 Client software

All wearable devices that were discussed in the previous sections and chapter usethe Android OS.We hence chose to develop the client for the Android OS.The application targets a minimum API level 22 (Android Lollipop). We will give adetailed overview of the architecture of the application in the following subsection.

3.4.1 Client architecture

The application we have built consists of several Activity classes. They are thefront-end of the application. We also use POJO’s which help the representation ofthe entities and other utilities.The class diagram is displayed in Figure 3.2.

5https://iameui-eagan-prod.thomsonreuters.com/iamui/UI/createUser?app_id=

Bold&realm=Bold

62

https://iameui-eagan-prod.thomsonreuters.com/iamui/UI/createUser?app_id=Bold&realm=Bold

https://iameui-eagan-prod.thomsonreuters.com/iamui/UI/createUser?app_id=Bold&realm=Bold


Figure

3.2:

The

clas

sdia

gram

ofth

em

ost

imp

orta

nt

asp

ects

ofth

ecl

ient

soft

war

e.

63


The application starts with the MainActivity. This class shows the view ofthe camera to the user. This allows the user to continue his activity without beinginterrupted.This class is also responsible for periodically taking pictures and sending it to thewebservice for processing.This period is empirically set to 150000 milliseconds. Uploading the image andparsing the response from the server is however not a part of its tasks. Once theresponse from the webservice has been processed, the user is notified (using soundnotifications) whether new content is available.At this point, the user has two options. Using swipe gestures, he can either swipeto the right or to the left to view the available content.When the user swipes to the right, he will see an overview of the detected entitieswhere he can select one of them (Figure 3.4a). By choosing an entity, the user willbe shown a feed (Figure 3.4b)with all the available media (see subsection 3.4.3 ).The user can also swipe to the left, in this case he will be presented with all theavailable images (for all the detected entities) that was fetched from the webser-vice (Figure 3.4c). The images that are shown (either in the entity feed or imagegallery) can be selected to be seen in full size. This flow is illustrated in Figure 3.3.The entity feed, GIF gallery and the way of notification are our proposed solutionsfor the real-estate challenge. The feed allows us to pack as much heterogeneousinformation as possible in one location. The same analogy applies the GIF gallery.Furthermore, by enabling the application to run in portrait mode as well as land-scape mode, the user can choose to show more information on the screen.We have also enabled the user to use different techniques to view densely packedinformation e.g. he has the ability to zoom-and-pan when viewing the images.

64


Figure 3.3: The possible flow of the Android client application.

65


(a) An overview of the de-tected entities.

(b) A feed for a given en-tity (Isildur in this case).

(c) Gallery of all the im-ages for all the entities.

3.4.2 Fetching and processing webservice response

The Connector class is responsible for sending the data to the webservice andprocessing its response. This response is, as mentioned earlier, in JSON format.Given that we don’t want the application to stop responding while the image isbeing uploaded and the responses is processed, the built-in Android SDK conceptAsyncTask is used.The processing takes the JSON response and creates a set of entity objects.In our proof-of-concept application we have only two kind of entities: charactersand locations. These are also the classes that are created when parsing the JSON.Each of these classes have some basic values (such as names, tweets for that entity,etc.). These values are abstracted in a super Entity class.

3.4.3 Possible media

There are a lot of possibilities when it comes to possible media that can be usedto enhance the user experience.We have selected four possible media channels that are used in our prototype. Wewill describe them shortly.

66


Lineage information

In the use case of our prototype (Lord of the Rings(LOTR)) there are at least 982characters.This can have consequences where the user cannot distinguish characters and howthey relate to the story.To resolve this we can show lineage information of the character entities that weare able to recognise. We use information that is available on a specialised websitecalled the Lotrproject6. This media channel has detailed lineage information on allthe possible LOTR-characters.This detailed information can also be used to perform preprocessing on the re-sponse from the NLP-services. If an entity is recognised as character and it cannotbe matched with any character from this source, then we disregard the informa-tion.An example of this media channel is shown in Figure 3.5.

Figure 3.5: An example of the lineage information on the character Elrond.

Map information

Another highly interactive media-channel, is the Hobbit project7. This project usesWebGL and HTML5 to create an interactive map where the user can explore thelocations of the LOTR-world as well as journeys of characters more in-depth.An example is shown in Figure 3.6.

6http://lotrproject.com7http://middle-earth.thehobbit.com/

67

http://lotrproject.com

http://middle-earth.thehobbit.com/


(a) An overview of the map with locations

(b) A specific location on the map. Clicking on the locationwill allow the user to interactively explore the area.

(c) The journey of the characters is also visualised on themap and can be interactively viewed (Gandalf’s journey inthis case.)

Figure 3.6: Some examples of the map media channel.

68


Images

The third channel that we use are GIF images. There are plethora of GIF servicesavailable. We chose to use Giphy8 for its simplicity of API.It allows us to query the service for recognised entities.Besides its simplicity, Giphy has also the advantage of providing a lot of usefulinformation about the GIFs that other services do not offer.A list of (self-explanatory) properties is shown in Table 3.5.

Property

height

width

number of frames

url

size (in bytes)

Table 3.5: Highlights of the image properties provided by Giphy.

Tweets

Lastly we chose to query Twitter for tweets on the character or location entitieswe have recognised.Twitter not only shows other users’ experience, but it can also have embeddedmedia as can be seen in Figure 3.7.

8http://giphy.com and https://github.com/Giphy/GiphyAPI

69

http://giphy.com

https://github.com/Giphy/GiphyAPI


Figure 3.7: An example of a tweet about the character Elrond.

We want to note that two of these channels are hard-coded and are only usable

70


for this context, whereas the GIF images and tweets are not. It is a challenge tofind suitable media without being context-specific.In the next section we will describe the experiments we have conducted and theirresults.

71

4Experiments

In the previous chapter we mentioned that we had chosen Alchmey API andOCRopy as default software of our webservice.In order to define these, we conducted several experiments which have given usthe incentive to set a particular software or service as default. In the followingsubsections we describe these tests, their setup and their results.

Although the focus is on the OCR-software, the performance of the OCR-softwarealso greatly impacts the performance of the NLP-service.Indeed, if the OCR-software performs bad, the likelihood of a good performanceof the NLP-service decreases, given that in our practical cases the OCR-result ispassed to the NLP-service.For both services we have selected six scenarios and have used two English books.To be precise the first two in the trilogy of the Lord of the Rings series, i.e. TheFellowship of the Ring and The Two Towers, both written by J.R.R. Tolkien.Each of the scenarios is selected to test a particular (sub)set of functions of theOCR-software and NLP-service.The first and second scenario consist of respectively two pages and a close-up oftwo pages from the Two Towers.The edition of the Two Towers we had for testing contained a larger font comparedto The Fellowship of the Ring, as can be seen in Figure 4.1.

72

CHAPTER 4. EXPERIMENTS

Figure 4.1: The difference in the font size of the two books; the Two Towers (right)and the Fellowship of the Ring(left)

The first scenario uses these two pages and tests whether a bigger font enhancesthe functionality of the software, in particular the OCR-software.The second scenario uses a close-up of one of these pages to check if removing noise(such as extra background, etc.) will result in a better performance.These tests are aimed to test the OCR-software. A downscaled version of thesepictures can be seen in Figure 4.2 1.

1These images and all the images that will be discussed in this section can be seen in fullresolution on https://github.ugent.be/aghodsi/thesis

73

https://github.ugent.be/aghodsi/thesis


(a) The image containing a bigger font.Both pages are present as well as the bind-ing.

(b) A close-up of 4.2a used for experiments.

Figure 4.2: One version of the images used for scenario one and two.

The third scenario tests if the use of a single page impacts the performance ofthe OCR-software.Using one page reduces the number of lines and words disappearing in betweenthe bindings, it also contains less noise.Figure 4.3 shows one of the versions of the one-page image used in our experiments.

74


Figure 4.3: One version of the one-page image used in our experiments.

75


In the fourth and fifth scenario we tried to find out what happens if “special”pages are fed to the OCR-software.Both of these scenarios contain two pages. In one of the scenarios one page con-tains only text, while the second page contains text combined with an image whichuses a different font (see Figure 4.4a).The other scenario uses pages containing italic and indented text, as seen in Figure4.4b.

(a) Image containing both text as well as animage with a different font.

(b) Pages containing italic and indentedtext.

76

4.1. OCR SOFTWARE CHAPTER 4. EXPERIMENTS

The sixth scenario is aimed to test the NLP-software. This two-paged imagecontains a high number of characters and locations.The scenario should point out which NLP-service has the best performance. Anoverview of the scenarios is given in Table 4.1.

Scenario name Use case

big font Two pages with a bigger font than the rest of the images

big font close up Same as big font, except only a part of one page is visible

one page Contains only one page

text imageContains one page of text and one page of text

which contains a different kind of font

indentationConsists of two pages, one of the pages has

italic and indented text

high chars Two pages containing a high amount of entities (locations, characters, etc.)

Table 4.1: An overview of the scenario’s and their use-cases.

In the next two sections we will describe the experiments and their results.

4.1 OCR software

Testing OCR-software can be approached in multiple ways. For example one wayis to train the OCR-software for a particular kind of text.The OCR-software will then learn from its mistakes and enhance recognition inthe future.We tested how the OCR-software performs outside-the-box without any trainingor modifications. We have chosen this approach because the focus of this disser-tation is not OCR-optimizations.At the same time, we want a through test of the basic abilities of the OCR-software.We have tested the scenarios as described in the previous section.The pictures were taken in such a way that they simulated realistic reading-positions.For each of the scenario two pictures were taken in different light settings to sim-ulate possible reading scenarios as can be seen in Table 4.2.From the pictures taken with our test device (Epson Moverio BT-200) it became

clear that the image quality was too bad to be processed by the OCR-software ascan be seen in Figure 4.5.

77


Light condition (in lux) Simulates

75 Low light reading conditions

800 Reading using reading light

Table 4.2: The different light conditions used in our experiments.

The first problem we had to resolve was the resolution of the images (640 pixelsby 480 pixels). We solved this by scaling the images as described in the previouschapter.

Figure 4.5: Some sample images taken with the Epson Moverio BT-200 wearable.

78


But even then, the image quality remained abysmal to be processed. This canbe seen from the results we will present in a bit.Because we did not have another wearable device to our availability, we simulatedwearable devices (such as the Google Glass) by using a generic smartphone witha picture quality of maximum 5 megapixels without using flash.These results were significantly better. The difference between a picture of thesame page taken from the glass (Epson Moverio BT-200) and the smartphonecamera can be seen in Figure 4.6

79


Figure 4.6: The same page captured by the Epson Moverio and a generic 5MPcamera in the same light condition.

4.1.1 Results

The tests we discuss here are also defined in our web service.Browsing to <baseurl>/testOCR, where <baseurl> is the location where the webserver is running, will return the results of the tests in a HTML-page along withthe OCR-accuracy.

80


Clicking one of the links will lead to the character-based diff in a human-readableformat. A call to the webservice is shown in Figure 4.7.

Figure 4.7: A call to the application on the webserver. In the left column all thepossible images that were fed to the OCR are shown. The right column shows theoutputs from the OCR matched (character based) to the ground-truth text.

For now, the tests are semi-automatized. The webservice will loop over all thepossible light configurations and the available ground-truth files for our scenariosand for each of these it will call both available OCR-software.The results that we obtained from the OCR-software are presented in Table 4.3 and4.4. A visualization of the results is presented in Figure 4.8 for the smartphone-camera images and Figure 4.9 for the BT-200.

81


Sce

nari

os

big

font

big

font

close

up

one

page

text

image

indendati

on

hig

hch

ars

Avera

ge

Lig

ht

condit

ion

(in

Lux)

7580

075

800

7580

075

800

7580

075

800

7580

0

Acc

ura

cyof

OC

R-S

OF

TW

AR

E(i

n%

)

OC

Ropy

84.8

873

.98

75.4

987

.35

89.2

496

.79

71.9

275

.69

88.6

971

.85

84.8

835

.84

82.5

273.5

8

Tess

era

ct0

042

.29

40.3

252

.742

.01

29.3

447

.08

15.8

40.

020

023.3

621.5

7

Table

4.3:

The

accu

racy

resu

lts

obta

ined

from

our

web

serv

ice

for

the

imag

esof

the

cam

era.

Sce

nari

os

big

font

big

font

close

up

one

page

text

image

indendati

on

hig

hch

ars

Avera

ge

Lig

ht

condit

ion

(in

Lux)

7580

075

800

7580

075

800

7580

075

800

7580

0

Acc

ura

cyof

OC

R-S

OF

TW

AR

E(i

n%

)

OC

Ropy

00

033

0.29

00.

140

00

00

0.0

55.5

Tess

era

ct0

00

00

00

00

00

00

0

Table

4.4:

The

accu

racy

resu

lts

obta

ined

from

our

web

serv

ice

for

the

imag

esof

the

wea

rable

Epso

nM

over

ioB

T-2

00.

82


The accuracy percentage of both software were tested using DiffMatchPatchfrom Google2. We compared the text output from the OCR with ground-truthfiles.Our first approach using Linux’ built-in diff software was not sufficient. The OCRresult would often add empty lines or lines would be recognized entirely correctly,but due to the empty lines or character misalignment end up on a different placein the paragraph resulting.This would trigger diff to recognize the lines as incorrect. Instead we used Diff-Matchpatch which does a character-based diff instead of a line-based diff.

Figure 4.8: The accuracy of the OCR-software for pictures taken with thesmartphone camera.

Figure 4.9: The accuracy of the OCR-software for pictures taken with theMoverio BT-200. Note: Tesseract hasaccuracy 0%

Figure 4.10: The average performanceof OCR software depending on the de-vice the pictures were taken with.

Figure 4.11: The average performanceof OCROpus vs. Tesseract.

From the plot in Figure 4.10 it is clear that with an average accuracy of 1.4%the camera of the Epson Moverio BT-200 is too weak to be used in an environmentwhere one would need to OCR the images taken from the device.

2https://code.google.com/p/google-diff-match-patch/

83

https://code.google.com/p/google-diff-match-patch/

4.2. NLP SOFTWARE CHAPTER 4. EXPERIMENTS

The camera images simulating wearables, suited with better imaging sensors, how-ever, seem to be well-suited for the job with an average accuracy of 50.2%.If we look further than the imaging devices, Figure 4.11 shows that OCRopy out-performs Tesseract out-of-the-box.OCRopy manages to achieve an average accuracy of 40.4% versus the mediocreaccuracy of 11.2% of Tesseract.We believe this is due to the internal workings of both software. Specifically, thatOCRopy does a better job of correcting the skew of images.If we look at both of the software solutions’ accuracy, it also supports most of ourhypotheses about our scenarios.The absence of light can degrade the performance of OCR-software, although thisis not everywhere the case. We believe this is an effect of the possible skews in theimage and extra artefact rather than the light settings.It also shows that indeed if we use one page (scenario 3) instead of two pages(and hence remove the bindings), the OCR-software can perform better, it evenbecomes near 100% recognition in case of OCRopy.At the same time the special scenarios, i.e. scenario 5 and 6, perform less well.Especially when there is another font or image on the same image. Italic text andindentation do not affect the performance notably.Using bigger fonts or close-ups also does not seem to actually enhance the per-formance of the OCR-software as much when using the camera. It does howeverseem to enhance the results a tad for the glass images with an accuracy of 33%versus the usual 0%.Taking into account the previously described results, we conclude that OCRopyoutperforms Tesseract without additional configuration or training. Therefore wehave chosen OCRopy as the default OCR-software for our web service.

4.2 NLP software

We also conducted experiments to find out the performance of the NLP serviceswe use.NLP services, as described in the literature chapter (2.2.4), can offer different kindof tools and services.OpenCalais for example is able to recognize a variety of situations such as businessdeals, birthdays while Alchmey API can deduce concepts, actions and taxonomies.One thing that both services offer is entity extraction. This service analyses textand extracts entities such as persons, locations, organisations, etc.Therefore we chose to test the entity extraction abilities of both services.OpenCalais is in addition also able to recognize family relationships (whereasAlchemy cannot), this is also tested in our experiments.

84


We took the six scenarios we described before and we extracted the entities inorder to create ground truth files. The files contain one entity per line followingthe <key, value> structure, where the key is the type of entity and the valueis the actual text found in the ground truth text. An excerpt is shown below.location, Mordor

name, Last Alliance of Elves and Men

person, Gil-galad

location, Beleriand

location, Thangorodrim

person, Frodo

person, Earendil

We did however drop two of the scenarios (big font and text image) due to thelow amount (less than ten) of entities present.

The possible keys that we use are also shown in Table 4.5.

Key

person

name

location

relation

Table 4.5: Possible keys for the ground truth files.

The person and location keyword are self-explanatory. The name representsan entity that can be seen as both person and location. E.g. “Apple” is notonly an organisation (and fruit) but can also be seen as the headquarters of theorganisation in Cupertino.An exception to the form <key, value> structure can be found in the high_charsscenario where also relationships are listed. These entities use relation as key,while the value is a triplet in which the relationship is defined. The values use thefollowing structure: person, relation, relative.As an example, consider the following excerpt:Elwing, daughter of Dior, son of Luthien of Doriath.

In this excerpt we can define the following relation-triplets:Elwing, daughter, Dior

Dior, son, Luthien

85


The amount of entities per case are shown in Table 4.6.

big font one page indendation high chars

Number of entities 26 15 24 48 (3 relations)

Table 4.6: The number of entities per ground-truth scenario.

4.2.1 Results

For these experiments we tested the services using two cases.One in which the ground-truth text from the selected scenarios is passed to theNLP in its entirety, while in the second case text from the OCR is passed to theservice.Indeed, in a real-life situation, the ground truth text is rarely present.Due to the terrible OCR results of the images taken with the Epson Moverio BT-200, we decided to conduct the experiments using only the OCR outputs for theimages taken with the smartphone camera.We also opted to use the better lighting condition (800 lux) and dropped the lowlighting condition (75) lux.Furthermore, OCRopy was used as OCR software given its superior performance.

These tests are also defined in our webservice. Browsing to <baseurl>/testNLP,will return the results of the tests in a HTML-page along with the accuracy.Clicking one of the links will lead to an HTML-page containing a table with all theground-truth entities and whether the NLP service was able to identify the entity.An example of a call to the webservice is shown in Figure 4.12.

86


Figure 4.12: A call to the NLP-test case web service.

Our tests are again semi-automated. For every ground-truth file we create adictionary of the entities contained within the file.Subsequently the corresponding (800 lux-version) image from the smartphone cam-era is fed to OCRopus.Afterwards for each NLP-service two calls are made, one with the OCR-outputand once for the ground truth file content.Both NLP-APIs will return a JSON-structured response. From these JSON-responses we extracted and separated the location and person entities in sets.We identified these entities using an array of all the possible values used by theservices that could be used to identify such entities3.Once the extraction process is over, we loop over the dictionaries containing theentities and for each entity check if the NLP-service was able to recognize it.For the entities with name type, the entity had to be in at least one of the extractedsets.

The results that we obtained are presented in Table 4.7 for the ground truthtext and in Table 4.8 for the OCR-output text. These results are also visualizedin Figure 4.13 and 4.14 for respectively the ground truth text and OCR-text.

3For Alchemy API, a list of all defined entities can be found on http://go.alchemyapi.com/

hs-fs/hub/396516/file-2576982340-xls/AlchemyAPI-EntityTypesandSubtypes.xls. Therelevant entities for OpenCalais were extracted from an API documentation PDF document.

87

http://go.alchemyapi.com/hs-fs/hub/396516/file-2576982340-xls/AlchemyAPI-EntityTypesandSubtypes.xls

http://go.alchemyapi.com/hs-fs/hub/396516/file-2576982340-xls/AlchemyAPI-EntityTypesandSubtypes.xls


Scenarios

big font one page indendation high chars Average

NLP service accuracy

(in %)

Alchmey API 12.00 35.71 25.00 44.68 29.34

OpenCalais 8.00 14.29 16.67 21.28 15.06

Table 4.7: The accuracy results for the NLP services for the ground truth texts.

Scenarios

big font one page indendation high chars Average

NLP service accuracy

(in %)

Alchmey API 12.00 35.71 29.17 10.64 21.88

OpenCalais 8.00 14.29 12.5 4.26 9.76

Table 4.8: The accuracy results for the NLP services for the OCR texts.

Figure 4.13: The accuracy of the NLPservices with the ground truth text.

Figure 4.14: The accuracy of theNLP services with text from the OCR-output.

Figure 4.15: The average performanceof NLP services independently from thekind of text used.

Figure 4.16: The average accuracy ofthe NLP services based on the kind oftext that was used.

When we ignore the kind of text used for the NLP services, it becomes clearfrom Figure 4.15 that Alchemy API is twice as good in recognizing entities (25.61%vs. 12.41%).

88

4.3. CLIENT SOFTWARE CHAPTER 4. EXPERIMENTS

Furthermore, we can deduce that using ground-truth text renders the best result(22.2%), it is however comforting that the difference is only about 6.4 %.Another result that is not shown here is that OpenCalais was able to recognizeone out of the three defined relationships in our ground truth text.

Based on these results, we conclude that for general entity extraction AlchemyAPI is a better fit. Although a combination of the two is also possible as Open-Calais was able to recognize some elements (besides relationships) whereas Alchemydid not and vice versa.

4.3 Client software

In the last section we will discuss some of the key elements that influence the userexperience and how well our prototype application behaves.The experiments conducted were designed to test whether the prototype in itsactual form is able to survive in a basic real-life environment.Furthermore, for all the tests OCRopy was used as OCR-software and AlchemyAPIwas used as NLP-service (unless otherwise mentioned).Three kind of tests were performed:

1. Time measurements: we measured how much time it takes for the wholeprocess of taking an image, sending it to the server, processing the responseand notifying the user that content was available.We tested this for two settings: one in which a stable WiFi connection wasused (to simulate reading at home or at work) and one in which an Edgeconnection was utilised (to simulate reading in the train or in remote areas).Although this is a pure networking test, it is important for the user experiencein general.

2. Memory measurements: in the given context, it is also interesting to takea look at the memory consumption. We are fetching and showing a lot ofimages, but we are using devices which are limited in memory.The wearable device we used for our tests for example had only 1 GB availableRAM memory.

3. Battery consumption: lastly we will investigate the impact of the prototypeon the battery life of the device.

89


4.3.1 Results

The results of all our tests are the average of 5 runs to ensure that anomalies werefiltered.We will start with the network time measurements, the results are listed in Figure4.17.We used both OCR technologies here, because the difference between them isenormous.

Figure 4.17: A comparison between the WiFi and Edge network types.

It is clear that although OCRopy performs better in accuracy (see previoussubsection), the impact on the user experience is way worse regardless of the usednetwork technology.While the whole process of taking an image, sending it and processing the responsetakes less than one minute with Tesseract, the same process needs roughly 5 min-utes using OCRopy.When we take a closer at the results based on the used network technology, itshows an unmistakeably large flaw of the mobile devices. Using Edge, the averageprocessing time (regardless of the OCR-technology) is about 7 minutes.It is also important to note that this process does not include fetching the actual

90


media. Loading images takes additional time.

Next we will look at the memory consumption, as we mentioned earlier thedevices we are investigating do not have a lot of memory available.The Epson Moverio has only 1 GB ram available, for this reason we also performedthe tests on the smartphone which had 3 GB available.The tests were performed over WiFi We tested what happened to the memory con-sumption when one would open the GIF gallery once the results were processed.From Figure 4.18a and 4.18b it is clear that the glass does not reserve a lot ofmemory for the app due to the shortage of memory.At the same time the memory usage doubles when the gallery activity is opened(Figure 4.18b around second 25).The memory usage on the smartphone is comparable, but the available memory isa lot higher (Figure 4.18c).The memory usage may not seem important, but the lack of memory on the wear-able devices seriously impacts the user experience. Not only will the applicationbe less smooth to run and slower in processing, but the prototype even crashes onthe glass because the Android OS will terminate the application due to memory-shortage.

91


(a) Memory consump-tion for the prototype onthe Epson Moverio beforeopening the GIF gallery.

(b) Memory consump-tion for the prototype onthe Epson Moverio, whileopening the GIF gallery.

(c) Memory consumptionfor the prototype on thesmartphone before andwhen viewing the GIFgallery

This memory-shortage problem can be explained because of the way the glassis built. It is built in such way that the view is always in landscape mode. Fur-thermore, the projected screen size emphasizes this problem even more.These design choices imply that when viewing the gallery, the gallery is forced tofetch a lot more images than it would have to when using portrait mode.A comparison between the landscape mode of the glass and the landscape modeof the smartphone is shown in Figure 4.19. From the same figure it is clear thateven when both devices are in landscape mode, the glass will try to load far moreimages (36 images on the glass versus 18 on the smartphone).

92


(a) The GIF gallery onthe smartphone.

(b) GIF gallery on theglass.

Figure 4.19: A comparison between the amount of pictures between landscape onthe smartphone and glass.

This can of course be remedied by forcing the GIF gallery to show less images,but then we have to make a trade-off between showing a lot of information versussaving memory.Lastly we will take a look at the battery consumption. In this scenario, we ranthe prototype for 30 minutes, in which every time new data was available (whichwe enforced), opened the gallery view once and the feed once.The tests were always started when the battery was fully charged to a full 100%.In Figure 4.20 we can see a roughly linear battery consumption.

93


Figure 4.20: Battery consumption over a 30 minute time period.

This means that the application can only be used for a 2 hour period beforethe device will shut down. This is a major flaw as well of the wearable devicesbecause of their battery capacity. Based on these tests we are able to pinpointsome flaws of these devices:

• The hardware specifications are weak. This has consequences depending onthe context of the application.In our case the lack of memory and a weak camera proved to be troublesome.

• Network equipment can have a negative impact on the user experience. Wear-able devices are meant to be mobile and in a context where a network con-nection is necessary this can be a serious flaw.

In the next chapter we will propose some possible solutions.

94

5Future work

We have built a working prototype that is able to to detect context and deliverextra media to enhance the users experience. There is still some work left to bedone.The first step to take in a real-life situation is training the OCR for a betteraccuracy-results. This is also what we deem to be the next step forward. By train-ing the OCR software one could achieve a better detection of entities and enhancethe experience even further.Based on our tests we believe that it might also be beneficial to combine the datafrom the NLP-services. This can be used to detect false-positives, i.e. charactersand locations that are recognised but are not mentioned in the text. By combiningthe services, one could also detect relationships and dates which can be importantfor dynamic time lines (see further).Another possible enhancement would be to build the web-service as a more genericcompound. At this moment, some of the media channels are hard coded. Our sys-tem can easily be modified in order to add media channels that are generic. Forexample one could devise a REST-call that allows the user to add its own mediachannel by providing the URL, API-keys and an example of the API response andits keywords.Implementing these improvements could enable the implementation of a dynamictime line. We will explain this concept in the next section.

95

5.1. DYNAMIC TIMELINE CHAPTER 5. FUTURE WORK

5.1 Dynamic timeline

A problem that we did not mention is the time-line problem. This problem con-sists of knowing when events took place and being able to place them in somechronological order.Let us demonstrate this with an example. Suppose the user is reading the firstbook in the Lord of the Rings series (The Fellowship of the Ring) and that weshow lineage information for our entities. Then, suppose we encounter the char-acter Sam. The current approach is a naive approach in which we look up theinformation about Sam and return it to the user (See Figure 5.1).

Figure 5.1: Lineage information for Sam without dynamic time line

As one can see, this kind of lineage shows a lot of information, such as theperson who he marries and his children etc.But at that point in time, the user has no knowledge yet of any possible childrenof a spouse, so this might confuse him instead of enhancing his experience.

96


What we suggest is to build a time line that is generated dynamically as the userprogresses in the book. This can be achieved by building a list of characters thathave been encountered, the relationships that have been mentioned to that pointand analysing characters and dates. Let us demonstrate this once more with anexample. Suppose that we have the following fictional data that we have extractedfrom the pages that user has already read:"My old man Hamfast always told me to be careful around the Tooks"

said Sam. And it was so that in the year 822 Sam had travelled

And suppose that we know that Sam married Rose in the year 853, then we coulduse this time line to deduce that when showing the lineage information it has nouse to show anything about his wife or children. We also know that he has onlymentioned his father Hamfast and not his siblings or mother, so this informationcan also be omitted. This way we show him a lineage like the on in Figure 5.2.

Figure 5.2: Lineage information for Sam with dynamic time line

The next problem is how do we know that Sam married Rose in the year 853.This is information that we would like to extract from the semantic web or othersources which accurately depicts this kind of information.We note that at this moment a lot of information (such as birthdays of fictional

97


characters) is not present and this is cumbersome to add because of the niche-likeproperties of some of this information.

98

6Conclusion

In this dissertation we investigated how suitable wearable devices are when usedin context-sensitive areas and how we could that context to enhance the user ex-perience.We devised a use case where books were analysed in order to present the user withmedia that could enhance his experience.For this purpose, we needed to convert the image to text, extract entities from thistext and look up media for these entities.The conversion of images to text was done using established open source OCRsystems while the extraction of entities was handled by NLP-services.

We tested several NLP-services and OCR-software and concluded that:

1. OCRopy outperforms Tesseract out-of-the-box in terms of accuracy

2. Tesseract is much faster than OCRopy

3. AlchemyAPI is better at extracting entities than OpenCalais

With the help of these tests we were able to find the first flaw in our test-device.The camera resolution was too low to correctly perform OCR.We built a REST-webservice that automates the OCR and NLP-process and looksup media for the extracted entities. Furthermore, we also developed a client pro-totype for the Android OS to show this media to the user.We also ran some tests to measure the time of the interaction between the clientand server using different kind of network technologies. It showed that wearabledevices (and mobile devices in general) are not well suited when a network con-nection is needed.

99

CHAPTER 6. CONCLUSION

To solve the display real-estate issues, we used sound notifications as well as built-in techniques such as zooming and panning.We also built a feed that used built-in layout types which allowed us to pack asmuch information in one location.This leads us to the third shortcoming of wearable devices. Due to their weakhardware specifications they lack certain properties that enable a smooth user ex-perience, in our case this was the memory.The lack of memory forced the client software to force close when a lot of imageswere fetched. This happened when the gallery was shown to the user.All these flaws can be resolved by making trade-offs: the use of an external cameraor smartphone for scanning, use of WiFi only for applications that need network-ing and showing less media in order to use less memory.In short, we can conclude that wearable devices are certainly suitable for userexperience enhancement, although trade-offs have to be made.

100

Bibliography

[AAHM00] Eneko Agirre, Olatz Ansa, Eduard Hovy, and David Martınez. Enrich-ing very large ontologies using the www. arXiv preprint cs/0010026,2000.

[ABK+07] Soren Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann,Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a webof open data. Springer, 2007.

[B+99] Stephen A Brewster et al. Sound in the interface to a mobile computer.In HCI (2), pages 43–47, 1999.

[BC99] Stephen A Brewster and Peter G Cryer. Maximising screen-space onmobile computing devices. In CHI’99 extended abstracts on Humanfactors in computing systems, pages 224–225. ACM, 1999.

[BGST10] Volha Bryl, Claudio Giuliano, Luciano Serafini, and Kateryna Ty-moshenko. Using background knowledge to support coreference reso-lution. In ECAI, volume 10, pages 759–764, 2010.

[BHBL09] Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data-thestory so far. International journal on semantic web and informationsystems, 5(3):1–22, 2009.

[BLB+03] Stephen Brewster, Joanna Lumsden, Marek Bell, Malcolm Hall, andStuart Tasker. Multimodal’eyes-free’interaction techniques for wear-able devices. In Proceedings of the SIGCHI conference on Humanfactors in computing systems, pages 473–480. ACM, 2003.

[BLHL+01] Tim Berners-Lee, James Hendler, Ora Lassila, et al. The semanticweb. 2001.

101

BIBLIOGRAPHY BIBLIOGRAPHY

[BLK+09] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Soren Auer, Chris-tian Becker, Richard Cyganiak, and Sebastian Hellmann. Dbpedia-acrystallization point for the web of data. Web Semantics: science,services and agents on the world wide web, 7(3):154–165, 2009.

[BM00] Stephen Brewster and Robin Murray. Presenting dynamic informationon mobile computers. Personal and Ubiquitous Computing, 4(4):209–212, 2000.

[Bra14] M Michael Brady. Students prefer paper to screens/columns/the for-eigner. 2014.

[Bre02] Stephen Brewster. Overcoming the lack of screen space on mobilecomputers. Personal and Ubiquitous Computing, 6(3):188–205, 2002.

[Bre08] Thomas M Breuel. The ocropus open source ocr system. In ElectronicImaging 2008, pages 68150F–68150F. International Society for Opticsand Photonics, 2008.

[Bre09] Thomas Breuel. Recent progress on the ocropus ocr system. In Pro-ceedings of the International Workshop on Multilingual OCR, page 2.ACM, 2009.

[BSP+93] Eric A Bier, Maureen C Stone, Ken Pier, William Buxton, and Tony DDeRose. Toolglass and magic lenses: the see-through interface. InProceedings of the 20th annual conference on Computer graphics andinteractive techniques, pages 73–80. ACM, 1993.

[But09] Marius-Gabriel Butuc. Semantically enriching content using open-calais. EDITIA, 9:77–88, 2009.

[BXWM04] Patrick Baudisch, Xing Xie, Chong Wang, and Wei-Ying Ma.Collapse-to-zoom: viewing web pages on small screen devices by in-teractively removing irrelevant content. In Proceedings of the 17thannual ACM symposium on User interface software and technology,pages 91–94. ACM, 2004.

[CDN02] Pedro J Chamizo Domınguez and Brigitte Nerlich. False friends: theirorigin and semantics in some selected languages. Journal of Pragmat-ics, 34(12):1833–1849, 2002.

[CH08] Daniel Churchill and John Hedberg. Learning object design consid-erations for small-screen handheld devices. Computers & Education,50(3):881–893, 2008.

102


[Chi06] Luca Chittaro. Visualizing information on mobile devices. Computer,39(3):40–45, 2006.

[CK04] Minhee Chae and Jinwoo Kim. Do size and structure matter to mobileusers? an empirical study of the effects of screen size, informationstructure, and task complexity on user activities with standard webphones. Behaviour & Information Technology, 23(3):165–181, 2004.

[CT+94] William B Cavnar, John M Trenkle, et al. N-gram-based text catego-rization. Ann Arbor MI, 48113(2):161–175, 1994.

[DGM06] Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascalrecognising textual entailment challenge. In Machine learning chal-lenges. evaluating predictive uncertainty, visual object classification,and recognising tectual entailment, pages 177–190. Springer, 2006.

[EMS04] Parisa Eslambolchilar and Roderick Murray-Smith. Tilt-based au-tomatic zooming and scaling in mobile devices–a state-space imple-mentation. In Mobile Human-Computer Interaction-MobileHCI 2004,pages 120–131. Springer, 2004.

[EOW10] Boris Epshtein, Eyal Ofek, and Yonatan Wexler. Detecting text innatural scenes with stroke width transform. In Computer Vision andPattern Recognition (CVPR), 2010 IEEE Conference on, pages 2963–2970. IEEE, 2010.

[EVP00] Jacob Eisenstein, Jean Vanderdonckt, and Angel Puerta. Adapting tomobile contexts with user-interface modeling. In Mobile ComputingSystems and Applications, 2000 Third IEEE Workshop on., pages 83–92. IEEE, 2000.

[FSM98] Naomi Friedlander, Kevin Schlueter, and Marilyn Mantei. Bullseye!when fitts’ law doesn’t fit. In Proceedings of the SIGCHI confer-ence on Human factors in computing systems, pages 257–264. ACMPress/Addison-Wesley Publishing Co., 1998.

[Fur86] George W Furnas. Generalized fisheye views, volume 17. ACM, 1986.

[GF04] Carl Gutwin and Chris Fedak. Interacting with big interfaces onsmall screens: a comparison of fisheye, zoom, and panning techniques.In Proceedings of Graphics Interface 2004, pages 145–152. CanadianHuman-Computer Communications Society, 2004.

103


[GK] Lluıs Gomez and Dimosthenis Karatzas. Mser-based real-time textdetection and tracking.

[GT09] Hideaki Goto and Makoto Tanaka. Text-tracking wearable camerasystem for the blind. In Document Analysis and Recognition, 2009.ICDAR’09. 10th International Conference on, pages 141–145. IEEE,2009.

[HBB+06] Baden Hughes, Timothy Baldwin, SG Bird, Jeremy Nicholson, andAndrew MacKinlay. Reconsidering language identification for writtenlanguage resources. 2006.

[HIVB95] Beverly L Harrison, Hiroshi Ishii, Kim J Vicente, and William ASBuxton. Transparent layered user interfaces: An evaluation of a dis-play design to enhance focused and divided attention. In Proceedingsof the SIGCHI conference on Human factors in computing systems,pages 317–324. ACM Press/Addison-Wesley Publishing Co., 1995.

[HKP] Marcin Helinski, Mi losz Kmieciak, and Tomasz Parko la. Report on thecomparison of tesseract and abbyy finereader ocr engines. Technicalreport.

[Joh14] Kyle Mills Johnson. An investigation into the smart watch interfaceand the user driven data requirements for its applications. 2014.

[KB94] Gordon Kurtenbach and William Buxton. User learning and perfor-mance with marking menus. In Proceedings of the SIGCHI conferenceon Human factors in computing systems, pages 258–264. ACM, 1994.

[KEH+96] Tomonari Kamba, Shawn A Elson, Terry Harpold, Tim Stamper, andPiyawadee Sukaviriya. Using small screen space more efficiently. InProceedings of the SIGCHI Conference on Human Factors in Com-puting Systems, pages 383–390. ACM, 1996.

[LB03] Joanna Lumsden and Stephen Brewster. A paradigm shift: alternativeinteraction techniques for use with mobile & wearable devices. InProceedings of the 2003 conference of the Centre for Advanced Studieson Collaborative research, pages 197–210. IBM Press, 2003.

[LDL05] Jian Liang, David Doermann, and Huiping Li. Camera-based analysisof text and documents: a survey. International Journal of DocumentAnalysis and Recognition (IJDAR), 7(2-3):84–104, 2005.

[Lid01] Elizabeth D Liddy. Natural language processing. 2001.

104


[LLV+13] Andres Lucero, Kent Lyons, Akos Vetek, Toni Jarvenpaa, Sean White,and Marja Salmimaa. Exploring the interaction design space for in-teractive glasses. In CHI’13 Extended Abstracts on Human Factors inComputing Systems, pages 1341–1346. ACM, 2013.

[LW02] Rainer Lienhart and Axel Wernicke. Localizing and segmenting text inimages and videos. Circuits and Systems for Video Technology, IEEETransactions on, 12(4):256–268, 2002.

[MdR01] Christof Monz and Maarten de Rijke. Light-weight entailment check-ing for computational semantics. In Proc. of the third workshop oninference in computational semantics (ICoS-3), 2001.

[MGLM12] Carlos Merino-Gracia, Karel Lenc, and Majid Mirmehdi. A head-mounted device for recognizing text in natural scenes. In Camera-Based Document Analysis and Recognition, pages 29–41. Springer,2012.

[Mil95] George A Miller. Wordnet: a lexical database for english. Communi-cations of the ACM, 38(11):39–41, 1995.

[MVH+04] Deborah L McGuinness, Frank Van Harmelen, et al. Owl web ontologylanguage overview. W3C recommendation, 10(10):2004, 2004.

[ODC+08] Eyal Oren, Renaud Delbru, Michele Catasta, Richard Cyganiak, Hol-ger Stenzhorn, and Giovanni Tummarello. Sindice. com: a document-oriented lookup index for open linked data. International Journal ofMetadata, Semantics and Ontologies, 3(1):37–52, 2008.

[PS+08] Eric Prud’Hommeaux, Andy Seaborne, et al. Sparql query languagefor rdf. W3C recommendation, 15, 2008.

[SHM+14] JE Sasaki, A Hickey, M Mavilia, J Tedesco, D John, Keadle S Kozey,and PS Freedson. Validation of the fitbit wireless activity tracker®for prediction of energy expenditure. Journal of physical activity &health, 2014.

[SHY10] Akbar S Shaik, Gahangir Hossain, and Mohammed Yeasin. Design, de-velopment and performance evaluation of reconfigured mobile androidphone for people who are blind or visually impaired. In Proceedings ofthe 28th ACM International Conference on Design of Communication,pages 159–166. ACM, 2010.

105

BIBLIOGRAPHY

[Smi07] Ray Smith. An overview of the tesseract ocr engine. In ICDAR,volume 7, pages 629–633, 2007.

[Smi09] Ray Smith. Hybrid page layout analysis via tab-stop detection. InDocument Analysis and Recognition, 2009. ICDAR’09. 10th Interna-tional Conference on, pages 241–245. IEEE, 2009.

[Spy96] P Spyns. Natural language processing. Methods of information inmedicine, 35(4):285–301, 1996.

[Tur13] Joseph Turian. Using alchemyapi for enterprise-grade text analysis.Technical report, AlchemyAPI, 08 2013.

[WDB10] William Douglas Woody, David B Daniel, and Crystal A Baker. E-books or textbooks: Students prefer textbooks. Computers & Educa-tion, 55(3):945–948, 2010.

[XLP09] Fei Xia, William D Lewis, and Hoifung Poon. Language id in thecontext of harvesting language data off the web. In Proceedings of the12th Conference of the European Chapter of the Association for Com-putational Linguistics, pages 870–878. Association for ComputationalLinguistics, 2009.

[YBE+12] Mohamed Yahya, Klaus Berberich, Shady Elbassuoni, Maya Ra-manath, Volker Tresp, and Gerhard Weikum. Deep answers for nat-urally asked questions on the web of data. In Proceedings of the 21stinternational conference companion on World Wide Web, pages 445–449. ACM, 2012.

[YMOF14] Hanlu Ye, Meethu Malu, Uran Oh, and Leah Findlater. Current andfuture mobile and wearable device use by people with visual impair-ments. In Proceedings of the 32nd annual ACM conference on Humanfactors in computing systems, pages 3123–3132. ACM, 2014.

106

Appendices

107

ASetup WSGI with Apache

In this Appendix we explain shortly how to setup WSGI to allow python applica-tions (served by Flask in our case) communication with the outside world.As we mentioned earlier WSGI (Web Server Gateway Interface) acts as an inter-face between web servers and web-apps for python.We want our python webservice to be available so that client software can sendrequests to it and get responses. We assume that a Apache webserver is used andis already set up on a Linux machine.We will explain the steps for the Ubuntu distribution, but the steps are similar forother Linux-systems.In order to enable the communication between our python application and Apachewe can use several tools. We have chosen to work with mod_wsgi.In a terminal, enter the following commands:sudo apt-get install libapache2-mod-wsgi python-dev+This will install the necessary mods and software, then activate the module asfollows:sudo a2enmod wsgi

In the next step, we assume that:

1. Flask is already installed

2. there is already a Flask-application app has already been created in the/var/www/ folder

We then need to create a new virtual host for Apache:sudo nano /etc/apache2/sites-available/app.conf

This file needs to contain information about the app such as where it is located

108

APPENDIX A. SETUP WSGI WITH APACHE

etc. An example of this file could be:

<VirtualHost *:80>

ServerName localhost

ServerAdmin admin@localhost

WSGIScriptAlias / /var/www/app/flaskapp.wsgi

<Directory /var/www/app/app/>

Order allow,deny

Allow from all

</Directory>

Alias /static /var/www/app/app/static

<Directory /var/www/app/app/static/>

Order allow,deny

Allow from all

</Directory>

</VirtualHost>

This new virtual host needs to be enabled:sudo a2ensite app Lastly we need to add a WSGI file, this needs to happen inthe folder where the Flask application is located (so in our case /var/www/app/):sudo touch flaskapp.wsgi

This file needs to contain the path to our application in the system path:

#!/usr/bin/python

import sys

sys.path.insert(0,"/var/www/app/")

from app import app as application

109

BShell scripts

Our shell scripts to perform OCR using OCRopy and Tesseract have been addedin this appendix.The shell scripts are quite similar and could be easily merged into one, we havekept them separate for didactic reasons.

Listing B.1: Shell script for OCROpus


2 #!/bin/bash -e

34 wd="/var/www/webservice/webservice/images/"

5 tempFolder="temp"

6 imageName="image"

7 extension=".jpg"

8 outputFile="result.html"

9 outputFolder="../ results/"

1011 while getopts ":w:o:f:i:e:t:" opt; do

12 case $opt in

13 w)

14 wd=$OPTARG

15 ;;

16 o)

17 outputFile=$OPTARG

18 ;;

19 f)

20 outputFolder=$OPTARG

110

APPENDIX B. SHELL SCRIPTS

21 ;;

22 i)

23 imageName=$OPTARG

24 ;;

25 e)

26 extension=$OPTARG

27 ;;

28 t)

29 tempFolder=$OPTARG

30 ;;

31 :)

32 echo "Option -$OPTARG requires an argument." >&2

33 exit 1

34 ;;

35 \?)

36 echo "Invalid option: -$OPTARG" >&2

37 ;;

38 esac

39 done


42 {




49 }

5051 cd $wd

52 rm -rf $tempFolder







61 done


111


64 if [ $? -eq 0 ]; then

65 ocropus -gpageseg -n --minscale 5 "$tempFolder /????. bin.png"

66 else

67 echo "Something went wrong , exiting"

68 exit 2

69 fi

7071 if [ $? -eq 0 ]; then

72 ocropus -rpred -n "$tempFolder /????/??????. bin.png"

73 else


75 exit 3

76 fi

7778 if [ $? -eq 0 ]; then

79 ocropus -hocr "$tempFolder /????. bin.png" -o $outputFolder$outputFile

80 echo "All good"

81 else


83 exit 4

84 fi

Listing B.2: Shell script for Tesseract

1 #!/bin/bash -e

23 wd="/var/www/webservice/webservice/images/"

4 tempFolder="temp"

5 imageName="image"

6 extension=".jpg"

7 outputFile="result"

8 outputFolder="../ results/"

91011 while getopts ":w:o:f:i:e:t:" opt; do

12 case $opt in

13 w)

14 wd=$OPTARG

15 ;;

16 o)

17 outputFile=$OPTARG

18 ;;

19 f)

20 outputFolder=$OPTARG

112


21 ;;

22 i)

23 imageName=$OPTARG

24 ;;

25 e)

26 extension=$OPTARG

27 ;;

28 t)

29 tempFolder=$OPTARG

30 ;;

31 :)

32 echo "Option -$OPTARG requires an argument." >&2

33 exit 1

34 ;;

35 \?)

36 echo "Invalid option: -$OPTARG" >&2

37 ;;

38 esac

39 done

4041 # cd $wd

42 # rm -rf temp

43 # ocropus -nlbin -n $imageName$extension -o $tempFolder

44 # if [ $? -eq 0 ]; then

45 # tesseract "$tempFolder /????. bin.png" $outputFile hocr

46 # else

47 # echo "Something went wrong , exiting"

48 # exit 1

49 # fi


52 {




59 }

606162 cd $wd

63

113








71 done

7273 tesseract $imageName$extension $outputFile hocr

74 if [ $? -eq 0 ]; then

75 ext=".hocr"

76 newExt=".html"

77 mv $outputFile$ext $outputFolder$outputFile$newExt

78 echo "All good"

79 exit 0

80 else


82 exit 2

83 fi

114

Enhancing user experience by data processing using wearable ...

Documents

Transcript of Enhancing user experience by data processing using wearable ...