Hidden Markov Model based gesture recognition on low-cost, low-power Tangible User Interfaces

Entertainment Computing 1 (2009) 75–84

Contents lists available at ScienceDirect

Entertainment Computing

journal homepage: ees .e lsevier .com/entcom

Hidden Markov Model based gesture recognition on low-cost, low-powerTangible User Interfaces

Piero Zappi *, Bojan Milosevic, Elisabetta Farella, Luca BeniniDEIS, University of Bologna, V.le Risorgimento 2, 40136 Bologna, Italy

a r t i c l e i n f o a b s t r a c t

Article history:Received 10 June 2009Revised 14 September 2009Accepted 27 September 2009

Keywords:Hidden Markov ModelsSmart objectTangible interfacesGesture recognitionFixed pointMultiple users

1875-9521/$ - see front matter � 2009 Published bydoi:10.1016/j.entcom.2009.09.005

* Corresponding author. Tel.: +39 051209 3835.E-mail addresses: [email protected] (P. Zappi)

bo.it (B. Milosevic), [email protected] (E.(L. Benini).

The development of new human–computer interaction technologies that go beyond traditional mouseand keyboard is gaining momentum as smart interactive spaces and virtual reality are becoming partof our everyday life. Tangible User Interfaces (TUIs) introduce physical objects that people can manipu-late to interact with smart spaces. Smart objects used as TUIs can further improve the user experiences byrecognizing and coupling natural gesture to command issued to the computing system. Hidden MarkovModels (HMM) are a typical approach to recognize gestures. In this paper, we show how the HMM for-ward algorithm can be adapted for its use on low-power, low-cost microcontrollers without floatingpoint unit that can be embedded into several TUI. The proposed solution is validated on a set of gesturesperformed with the Smart Micrel Cube (SMCube), a TUI developed within the TANGerINE framework.Through the paper we evaluate the complexity of the algorithm and the performance of the recognitionalgorithm as a function of the number of bits used to represent data. Furthermore, we explore a multiuserscenario where up to four people share the same cube. Results show that the proposed solution performscomparably to the standard forward algorithm run on a PC with double-precision floating pointcalculations.

� 2009 Published by Elsevier B.V.

1. Introduction

Traditional user interfaces define a set of graphical elements(e.g., windows, icons, menus) that reside in a purely electronic orvirtual form. Generic input devices like mouse and keyboard areused to manipulate these virtual interface elements. Although,these interaction devices are useful and even usable for certaintypes of applications, such as office duties, a broad class of scenar-ios foresees more immersive environments where the user inter-acts with the surroundings by manipulating the objects aroundhim/her.

Tangible User Interfaces (TUIs) introduce physical, tangible ob-jects that augment the real physical world by coupling digitalinformation to everyday objects. The system interprets these de-vices as part of the interaction language. TUIs become the repre-sentatives of the user navigating in the environment and enablethe exploitation of digital information directly with his/her hands.People, manipulating those devices, inspired by their physicalaffordance, can have a more direct access to functions mapped todifferent objects.

Elsevier B.V.

, [email protected]), [email protected]

The effectiveness of a TUI can be enhanced if we use sensor aug-mented devices. Such smart objects may be able to recognize usergestures and improve human experience within interactive spaces.Furthermore, the opportunity to execute a gesture recognitionalgorithm on-board brings several advantages:

1. The stream of sensors readings is not sent over the wirelesschannel. This reduces the radio use and extends object batterylife.

2. The reduced wireless channel usage allows the coexistence of alarger number of objects in the same area.

3. Each object operates independently and in parallel with theothers, improving system scalability.

4. The handling of objects moving between different physicalenvironments is facilitated.

5. No other systems, such as video cameras, are required to detectand classify user movements, thus system cost is reduced.

The SMCube is a tangible interface developed as a building blockof the TANGerINE framework, a tangible tabletop environmentwhere users manipulate smart objects in order to perform actionson the contents of a digital media table [5]. The SMCube is a cubecase with 6.5 cm edge. At current stage of development, it isequipped with sensors (digital tri-axes accelerometer as default)and actuators (infrared LEDs, vibromotors). Data from accelerome-ter is used to locally detect the active face (the one directed

http://dx.doi.org/10.1016/j.entcom.2009.09.005

mailto:[email protected]

mailto:[email protected] bo.it

mailto:[email protected] bo.it



http://www.sciencedirect.com/science/journal/18759521

http://ees.elsevier.com/entcom

https://www.researchgate.net/publication/221572823_Introducing_tangerine_A_tangible_interactive_natural_environment?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

76 P. Zappi et al. / Entertainment Computing 1 (2009) 75–84

upwards) and a set of gesture performed by the user (cube placedon the table, cube held, cube shaken and tap [8]). This informationis wirelessly sent to the base station that controls the appearanceand the elements of the virtual scenario projected on the digitalmedia table for processing. Furthermore, through the LEDs thenode can interact with a vision based system in a multi-modalactivity detection scenario.

The recognition algorithms developed in previous work rely ontime invariant features extracted from the acceleration stream. In amore general case the information used to recognize a gesture iscontained in the sequence of acceleration rather than in some par-ticular features. For this class of problem a different family of algo-rithms has been developed. Among the other, Hidden MarkovModels (HMMs) and their variants have been extensively usedfor gesture recognition.

HMMs belong to the class of supervised classifiers, thus they re-quire an initial training phase to tune their parameters prior tonormal operation. Even if the training of a HMM is a complex task,classification is performed using a recursive algorithm called for-ward algorithm. Although this process is a lightweight task com-pared to training, several issues must be considered in order toimplement it on a low-power, low-cost microcontroller such asthe one embedded on the SMCube.

In this paper, we present our fixed point implementation of theHMM forward algorithm. Through the paper we highlight the is-sues related to the implementation of this algorithm on low-com-putational power, low memory devices that cannot rely on floatingpoint arithmetic and, starting form the analysis of the standardfloating point implementation of the algorithm, we propose a solu-tion to these issues.

We evaluate how the use of fixed point variables with limitedaccuracy (8-16-32 bits) impacts system performance and if ouron-board solution can replace the analysis of data streams per-formed through the standard algorithm executed on a PC.

To characterize the performance of the algorithm we collected adataset where four users perform a set of four gestures (drawing acircle, a rectangle or an ex, flipping a page) holding the SMCube.The selected gestures are not related to a particular applicationbut can be associated to general meanings (i.e., the ex can repre-sent the command delete, and the flip a change the backgroundof an application). The recognition performance of the fixed pointimplementations using different data accuracies has been com-pared to the one of the standard implementation carried usingdouble precision considered as the target performance. Two sce-narios are considered: the cube is used by a single person and upto four people share the same object.

Furthermore, we characterize the computational and memorycost of the algorithm to evaluate the scalability of our approach interms of number of gestures that can be detected by a single cube.

The rest of the paper is organized as follow. Next section presentsan overview of the state of the art for TUIs. Section 3 introduces thearchitecture of the SMCube, while Section 4 focuses on the proposedactivity recognition chain and includes an overview of HMMs, the for-ward algorithm and the data preprocessing step. This section high-lights the critical steps for a fixed point implementation. Our fixedpoint solution is presented in Section 5. Section 6 presents the datasetused for test. Following we characterize the implementation (Section7) and discuss the results. Section 8 concludes the paper.

2. Related work

2.1. Tangible User Interfaces (TUIs)

Almost two decades ago, research began to look beyond currentparadigms of human–computer interaction based on computers

with a display, a mouse and a keyboard, in the direction of morenatural ways of interaction [44]. Since then concepts as wearablecomputing [22] and tangible interfaces [15] have been developed.

The use of TUIs has been proposed in many scenarios whereusers manipulate virtual environments. This has been proved tobe useful especially in applications for entertainment and educa-tion [14]. An analysis of the impact of TUI within a school sce-nario is presented in Ref. [30]. According to this work, researchfrom psychology and education suggests that there can be realbenefits for learning from tangible interfaces. An early study ondifferent interaction technologies including TUIs has been pre-sented in Ref. [33]. In this study the authors highlight how grasp-able interfaces push for collaborative work and multiple handinteraction. In Ref. [42] authors developed an educational puzzlegame and compared two interfaces: a physical one based on TUIsand a screen-based one. Results show how TUIs are an easiermean to complete assigned task and have higher acceptanceamong the 25 children between 5 and 7 years old involved inthe test.

TUIs can enhance the exploration of virtual environments. Virtualheritage (VH) applications aim at making cultural wealth accessibleto the worldwide public. Advanced VH applications exploit virtualreality (VR) technology to give to users immersive experiences, suchas archaeological site navigation, time and space travel, and ancientartifact reconstruction in 3D. Navigation through such virtual envi-ronment can benefit from the presence of tangible artifacts likepalmtop computers [11] or control objects [13]. In Ref. [31] the Tan-gible Moyangsung, a tangible environment designed for a group ofusers that can play fortification games, is presented. People, manip-ulating tangible blocks, can navigate in the virtual environment,solve puzzles or repair damaged virtual walls in an evocation of his-torical facts.

Interactive surfaces are a natural choice when developing appli-cations that deal with browsing and exploration of multimediacontents. On these surfaces users can manipulate elementsthrough direct and spontaneous actions. For example, in Ref. [4]multiple users can collaborate within an interactive workspacefeaturing vision based gesture recognition to perform knowledgebuilding activities such as brainstorming. On the reacTable [17]several musicians can share the control of the instrument bycaressing, rotating and moving physical artifacts with dedicatedfunctions on the table surface. The TViews is a LCD based frame-work where user can interact with displayed contents through aset of TUIs (puck) [24]. The puck is used to select and move virtualobject and its position is tracked using acoustic and infrared tech-nologies. Another example is the Microsoft Surface Computingplatform [26] where multiple users can share and manipulate dig-ital contents on a multi-touch surface.

The expressiveness of TUIs can be enhanced by the use of smartobjects. The MusicCube is a tangible interface used to play digitalmusic like an mp3 players [7]. The cube is able to understand theface pointing upwards and a set of simple gestures. This ability, to-gether with a set of controls and buttons, is used to choose the de-sired playlist and to control music volume. The display cube is acube shaped TUI equipped with a three-axes accelerometer, 6LCD displays (one per face) and a speaker, used as a learning appli-ance [23]. SmartPuck is a cylindrical shape multi-modal input–out-put device having an actuated wheel, a 4 way button, LEDs andspeaker and it is used on a plasma display panel [20]. SmartPuckallows multiple user interaction with menus and application andhas been tested by using it to navigate within Google Earth pro-gram in place of using a mouse and desktop. Some commercialThe WiimoteTM is a controller developed by Nintendo for its WiiTM

console [28]. This controller embeds an accelerometer, an infraredcamera and a Bluetooth transceiver and is used to interact with alarge number of applications and videogames.

https://www.researchgate.net/publication/221571937_Smart_clothing_Wearable_Multimedia_Computing_and_Personal_Imaging_to_Restore_the_Technological_Balance_Between_People_and_Their_Environments?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/32231466_Literature_Review_in_Learning_with_Tangible_Technologies?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/230561899_Tangerine_SMCube_a_smart_device_for_human_computer_interaction?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/220031551_A_Display_Cube_as_a_Tangible_User_Interface?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/224634924_wikiTable_finger_driven_interaction_for_collaborative_knowledge-building_workspaces?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/221095368_A_Tangible_User_Interface_with_Multimodal_Feedback?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/225609801_Learning_Cooperation_in_a_Tangible_Moyangsung?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/8041839_Tangible_photorealistic_virtual_museum?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/2311166_Tangible_Bits_Towards_Seamless_Interfaces_between_People_Bits_and_Atoms?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/220426835_The_Tangible_User_Interface_and_Its_Evolution_Commun?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/221252477_Design_and_Implementation_of_WiMoCA_Node_for_a_Body_Area_Wireless_Sensor_Network?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/220141799_MusicCube_A_physical_experience_with_digital_music?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/224927728_The_computer_for_the_21st_century?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/3681405_The_digital_playing_desk_a_case_study_for_augmented_reality?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

Fig. 1. The TANGerINE SMCube. The cube edge is 6:5 cm long. On the top left theinner surface of the master face that includes the microcontrollers, the accelerom-eter and the transceiver. On the top right the inner surface of the other five faces ofthe cube.

P. Zappi et al. / Entertainment Computing 1 (2009) 75–84 77

2.2. Gesture recognition

Gesture recognition algorithms typically are made up of foursteps: data acquisition from the sensors, data preprocessing to re-duce noise, extraction of relevant features from the data streamand classification. Several design choices are available at each step,depending on the application scenario, the activities that have tobe recognized, and the available computational power.

When features are time invariant (e.g., zero crossing rate or fre-quency spectrum), simple time-independent classifiers can be used(e.g., linear classifiers, such as Support Vector Machines, or deci-sion trees, such as C4.5). In a more general case, features are timedependent and classifiers suited for temporal pattern recognitionare used. Typical approaches include dynamic time warping [21],neural networks [3], conditional random fields [25] and HiddenMarkov Models (HMMs) [45].

Even if several classification algorithms have been proposed forimplementation on smart objects [35,34,16], the solutions pro-posed to recognize gestures performed with TUIs typically relyon vision systems [40,18,36], or on the collection and processingof data from an external PC [2,29,43], or on the recognition of sim-ple gestures through the analysis of time invariant features[46,7,9].

An exception to the previously cited papers is the work pro-posed by Kyoko et al. [41]. In this work the authors present them-ActiveCube, a physical cube equipped with sensors (ultrasonic,tactile and gyroscopes) and actuators (buzzer, LEDs and motors)that acts as bidirectional user interface toward a 3D virtual space.Multiple cubes can be connected and collaborate in achieving a de-fined task. In this paper the authors evaluate a fixed point imple-mentation for HMM able to perform speech recognition. Sincethe proposed algorithm cannot be implemented on a single cube,the basic idea here is to balance the computation among severalcubes. One of the main limits of this work is that the critical issueof data synchronization among different cube that participate tothe computation is not considered. Furthermore the authors as-sumes that all the cubes always participate to the speech recogni-tion, so every node of the network is a point of failure for the wholesystem. Finally the recognition ratio of the proposed algorithm isnot evaluated, therefore it is not possible to evaluate the perfor-mance of this solution.

In contrast to the work presented in Ref. [41], here we presentan algorithm able to recognize complex gestures and that can beimplemented on a single cube with much lower computationalpower and memory than the ones available on the m-ActiveCubes.As a consequence in our solution each cube is independent fromthe others, hence (1) it does not need any synchronization, (2) itis not a point of failure for the whole system, (3) multiple userscan operate on the table top at the same time, (4) wireless commu-nication need is reduced (only indications of gestures are sent)resulting in longer battery life and interference reduction.

To the authors’s best knowledge no previous works have beencarried on to fully embed such activity recognition algorithm onlow-power, low-cost 8-bit microcontrollers.

Fig. 2. SMCube LEDs patterns.

3. Smart Micrel Cube

The Smart Micrel Cube (SMCube) is a cube shaped artifact witha matrix of infrared emitter LEDs on each face (see Fig. 1). It em-beds a low-cost, low-power 8-bit microcontroller (Atmel ATme-ga168 [1]), a Bluetooth transceiver (Bluegiga WT12 Bluetoothmodule [6]) that supports Serial Port Profile (SPP) and a MEMStri-axial accelerometer (STM LIS3LV02DQ [39]) with a programma-ble full scale of 2g or 6g and digital output. The cube is powered

through a 1000 mA=h, 4.2 V Li-ion battery. With this battery thecube reaches up to 10 h of autonomy during normal operation.

The ATMega168 features a RISC architecture that can operate upto 20 MHz and offers 16 KB of Flash memory, 1 KB of RAM and512 Bytes of EEPROM. The microcontroller includes a multiplierand several peripherals (ADC, timers, SPI and UART serial interfaces,etc.). In our prototype the CPU operates at 8 MHz. The firmware hasbeen implemented in C using the Atmel AVR Studio 4 IDE that, usedin conjunction with avr-libc and the gcc compiler, provides all theAPIs necessary to exploit the peripherals and perform operationswith 8, 16, and 32 bit variables. Being written in C, the code is porta-ble on other devices. A wireless bootloader is used to load a new firm-ware on the cube without the need to disassemble it.

Each cube is identified by an ID number stored on the cube flashmemory, which helps disambiguating when more than one cube ispresent on the scene at a certain time.

The LEDs pattern on every face of the cube is composed of 8points (see Fig. 2). In the basic configuration only points p1, p2,p3, p5 are switched on, the remaining points are used as a binaryencoding of the cube id (note that this visual id is not related tothe cube id stored on the MCU flash).

In addition to the LED actuation can be provided through sixvibromotors mounted on cube faces.

The choice of the SMCube for this work is motivated by the factthat it is representative of a wide set of smart objects since it em-

https://www.researchgate.net/publication/4223190_Online_Context_Recognition_in_Multisensor_Systems_using_Dynamic_Time_Warping?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/4304441_ARemote_A_Tangible_Interface_for_Selecting_TV_Channels?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/220955041_Applying_Space_State_Models_in_Human_Action_Recognition_A_Comparative_Study?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/221516804_CubeBrowser_a_cognitive_adapter_to_explore_media_databases?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/215439417_reacTIVision_a_computer-vision_framework_for_table-based_tangible_interaction?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/228668725_Distributed_autonomous_interface_using_ActiveCube_for_interactive_multimedia_contents?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/228871389_Real_time_gesture_recognition_using_Continuous_Time_Recurrent_Neural_Networks?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/228951183_From_sensors_to_miniature_networked_sensorbuttons?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/220141799_MusicCube_A_physical_experience_with_digital_music?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/220851184_Manipulating_Multimedia_Contents_with_Tangible_Media_Control_System?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/4273137_Gesture_Classification_with_Hierarchically_Structured_Recurrent_Self-Organizing_Maps?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==


beds low-power, low-cost hardware that can be integrated in alarge number of devices. Therefore, the algorithm presented in thispaper can be used in a wide range of TUIs.

4. Activity recognition chain

Typical activity recognition systems are made up of four steps:(1) data preprocessing, (2) segmentation, (3) features extraction,(4) classification. At each step several design choices must betaken.

In this work, we do not address the problem of segmentationand we deal with isolated gestures. Yet, in the literature otherworks that propose solutions to this task can be found. For exam-ple, different approaches for time series segmentation are pre-sented in Ref. [19], an optimized approach for low-powerwearable systems can be found in Ref. [38] and a HMM based algo-rithm for hand gestures segmentation is proposed in Ref. [12].

In the following sections, we provide details on the classifica-tion, data preprocessing and feature extraction steps.

4.1. Hidden Markov Models

HMMs are often used in activity recognition since they tend toperform well with a wide range of sensor modalities [37] (they arealso used successfully in other problem domains, such as speechrecognition, for which they were initially developed [27]).

Several variants of HMM have been proposed to address tradi-tional algorithm lacking. For example, specific state duration ornull state transitions can be defined to better model speech signals[32], coupled HMM (CHMM) have been designed to better charac-terize multiple interdependent sequences[47], Factorial HMM(FHMM) and parallel HMM (PHMM) have been developed to com-bine in an efficient way multiple features but at different level ofabstraction (FHMM at feature level, PHMM at decision level)[10]. Therefore, the selection of the best model depends on theapplication and the set of gestures that we want to recognize.

Since in this work we want to evaluate the consequences, interms of performance loss, of implementing an algorithm from ofthe HMM family on low-power hardware without floating pointunit, we do not analyze specific variants and we investigate the ba-sic, standard, ergodic (fully connected) version of HMM.

A Hidden Markov Model (HMM) is a statistical model that canbe used to describe a physical phenomenon through a set of sto-chastic processes. It is composed by a finite set of N states(qt ¼ si where 1 < i < N). At every time step the state of the systemchanges. Transitions among the states are governed by a set oftransition probabilities, A ¼ aij ¼ Pðqtþ1 ¼ sjjqt ¼ siÞ (the probabilitythat the system is in the state i at time t and in the state j at timet þ 1). At every time step an outcome ðotÞ is generated, according tothe associated observation probabilities, B ¼ bi ¼ Pðojq ¼ siÞ (proba-bility of observing symbol o while the system is in state i). An ergo-dic HMM has aij – 0 8i; j. Outcomes can belong either to acontinuous domain (thus the bi are probability density function)or a discrete domain (therefore we have M symbols biðkÞ with1 < k < M). In the former case we deal with continuous HMM, inthe latter with discrete HMM. Only the outcome, not the state, isvisible to an external observer and therefore states are ‘‘hidden”to the outside. Furthermore a model is defined by the starting prob-abilities P ¼ pi ¼ Pðq1 ¼ siÞ. The compact notation for an HMM isk ¼ ðA;B;PÞ.

Training of a HMM is carried on using an iterative algorithmcalled the Baum–Welch, or Expectation–Maximization, algorithmand a set of reference instances. Once a set of models have beentrained (one for each class that we want to recognize), classifica-tion of new instances is performed using the forward algorithm.

Following, in order to clarify the notation used, we report onlythe forward algorithm.

4.1.1. The forward algorithmThe forward algorithm is a recursive algorithm that relies on a

set of support variables atðiÞ ¼ Pðo1; o2; . . . ; ot ; qt ¼ sijkÞ and allowsto find the probability that a certain model generated an input se-quence PðOjkÞ. It is made up of three steps.

1. Initialization: a1ðiÞ ¼ piðo1Þbiðo1Þ, 1 6 i 6 N.2. Induction: atþ1ðjÞ ¼ ½

PNi¼1atðiÞaij� � bjðotþ1Þ, 1 6 j 6 N and

1 6 t 6 T � 1.3. Termination: PðOjkÞ ¼

PNi¼1aTðiÞ.

The atðjÞ are sum of a large number of terms in the form

Yt�1

r¼1

asðrÞ;sðrþ1ÞYt

r¼1

bsðrÞðorÞ !

ð1Þ

Since both the aij and the biðkÞ are smaller than 1 as t become largeatðjÞ tends to zero exponentially and soon it exceeds the precision ofany machine.

In order to avoid underflow, the atðjÞ are normalized at everystep using the scaling factor ct ¼ 1PN

i¼1atðiÞ

. The scaled batðjÞ ¼ atct

areused in place of the atðjÞ.

When using normalization we cannot simply sum the batðiÞ inthe termination step, since their sum is equal to 1. However, wecan notice the following [32]:

XN

i

cat ðiÞ ¼YT

t¼1

ct

XN

i

aTðiÞ ¼ CT

XN

i

aTðiÞ ¼ 1 ð2Þ

YT

t¼1

ctPðOjkÞ ¼ 1 ð3Þ

PðOjkÞ ¼ 1QTt¼1ct

ð4Þ

log½PðOjkÞ� ¼ �XT

t¼1

log½ct � ð5Þ

The use of the logarithm in the last step is necessary in order toavoid underflow since the ct are smaller than 1 and their producttend to zero exponentially.

Note that any constraint is placed on the length of the sequenceto classify.

4.2. Data preprocessing and feature extraction

The input sequence of accelerations terns should be prepro-cessed to highlight important signal properties while reducing in-put dimensionality. In particular, we are interested in usingdiscrete features since discrete HMM are much less computation-ally demanding than HMMs operating on continuous observations.

Furthermore, we must note that when we perform a gestureusing the cube its initial orientation strongly affects the sequencesassociated to the gestures. For example, if we consider the trivialgesture where a user lifts the cube and put it down, according towhich face is directed upwards we can have six different sequencesout of the accelerometer (two for each axis). Moreover, the designspace further increase when considering more complex gestures,such as the ones presented in Section 6.

For these reasons, we decided to use only the module of theacceleration calculated as the sum of the squares of the accelera-tions along the three axes. The final square root is not computed.This results in reduced computational cost, since the number ofstreams to process diminish from three (one for each axis) to

https://www.researchgate.net/publication/3302281_Robust_speech_recognition_based_on_joint_model_and_feature_space_optimization_of_hidden_Markov_model?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/3949928_HMMs_and_coupled_HMMs_for_multi-channel_EEG_classification?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/2984124_Rabiner_L_A_Tutorial_on_Hidden_Markov_Models_and_Selected_Applications_in_Speech_Recognition_Proc_IEEE_772_257-286?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/2984124_Rabiner_L_A_Tutorial_on_Hidden_Markov_Models_and_Selected_Applications_in_Speech_Recognition_Proc_IEEE_772_257-286?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/210287951_Combining_Motion_Sensors_and_Ultrasonic_Hands_Tracking_for_Continuous_Activity_Recognition_in_a_Maintenance_Scenario?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/221260419_Velocity_Profile_Based_Recognition_of_Dynamic_Gestures_with_Discrete_Hidden_Markov_Models?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/3437322_Wearable_Activity_Tracking_in_Car_Manufacturing?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/3940200_An_Online_Algorithm_for_Segmenting_Time_Series?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==

https://www.researchgate.net/publication/224354561_Factorial_HMM_and_parallel_HMM_for_gait_recognition?el=1_x_8&enrichId=rgreq-a318cde9-b322-418b-98ff-cf3199b33ca4&enrichSource=Y292ZXJQYWdlOzIyMzk0Mjg0MjtBUzoyMTg2MDQwMzg1NjE3OTVAMTQyOTEzMDY3MjM5MA==


one, and in initial orientation independent sequences, i.e., they donot depend on where the user holds the cube.

In order to use a discrete model, we rely on discrete featuresymbols that indicate the acceleration direction (e.g., negativeacceleration, positive acceleration, and no acceleration). The con-version of magnitude samples at into a feature symbol ft is doneby means of one threshold DR as follows:

f it ¼

� for at < R� DR

0 for R� DR 6 at 6 Rþ DR

þ for at > Rþ DR

8><>: ð6Þ

where R is the acceleration magnitude when no movements areperformed ð1gÞ.

Table 1Basic operations computational complexity.

Operation Cost

Shift 1Variable compare 1Sum 8 bits 1

5. Fixed point solution

The low-power microcontroller embedded on the SMCube in-cludes a multipliers but not a divider. Therefore, it can efficientlycompute the steps required for the forward algorithm for discreteHMM, but not the one of the standard normalization procedure. Infact, as shown in the previous section, this algorithm requires toperform N divisions each time a new sample is processed.

To find a solution suitable for our MCU we must notice that theobjective of the normalization procedure is simply to keep the batðjÞwithin the range of the machine. Thus, it is not necessary thatPN

ibatðiÞ ¼ 1. We propose an alternative scaling procedure:

1. at each time step t, once computed the atðiÞ, check if the highestatðiÞ is smaller than 1

2, otherwise scaling is not needed at thisstep;

2. calculate the number of shift to the left lt needed to render thehighest atðiÞ greater than 1

2;3. shift all atðiÞ to the left of lt bits.

If, at a certain time t, all the atðiÞ are equal to zero they are allplaced at 1

2 and lt is equal to the number of bit with whom we rep-resent our data (data size). This procedure requires only shifts andcan be efficiently implemented on the low-power microcontrollerembedded into the SMCube.

Another problem arises when we need to compute the loga-rithm of the ct (see Eq. 5). However the proposed scaling procedureeases this task. In fact in this case the final probability is given bylog PðOjkÞ ¼ logðrÞ �

PTt¼1 log 2lt , where r ¼

PNicaT ðiÞ–1. By using

log2 we already have the value ofPT

t¼1 log 2lt by keeping track ofhow many shifts we performed for scaling. Furthermore, we donot need to compute logðrÞ since logarithm is a monotonicallyincreasing function. Thus, to compare 2 models, we simply checkfor the one that required less shifts for scaling, in case of tie theone with higher r is the model with higher PðOjkÞ.

Sum 16 bits 4Sum 32 bits 8Multiplication 8 bits 4Multiplication 16 bits 15Multiplication 32 bits 35

Table 2Algorithm complexity.

Algorithm Cost

atþ1 Calculation ðN2 þ NÞmulþ 2 � ðN2 � 1ÞsumNormalization ðN � 1Þ þ ðN þ 1Þ � ðdata size� 1ÞSingle step (8 bit) C � ½6N2 þ 12N þ 4�Single step (16 bit) C � ½23N2 þ 31N þ 7�Single step (32 bit) C � ½51N2 þ 67N þ 14�

5.1. Forward algorithm complexity

Classification of a new instance using HMM is performed bycomputing, through the forward algorithm, the probability thatan input sequence is generated by each model associated to a ges-ture. The instance is classified as belonging to the class of the mod-el that results in highest probability.

Therefore, once we detected the beginning of a new gesture,each time the MCU samples a new data from the accelerometerit must preprocess the input data, execute one step of the forwardalgorithm with all models and normalize the atðiÞ of all the models.

According to the standard algorithm presented above, one stepof forward algorithm (i.e., calculate atþ1ðiÞ, 1 < i < N) requires

1. the product between an N � N matrix (the transition probabili-ties matrix A) and the old N � 1 vector of the at (N2 multiplica-tion and ðN � 1Þ sums);

2. execute an element by element product of the resulting vectorwith the column of the observing probabilities matrix (B) asso-ciated to the output otþ1(N multiplication and ðN � 1Þ sums).

The scaling algorithm first finds the highest atðiÞ, than computesthe number of shift needed and finally shifts the other atðiÞ. To exe-cute this procedure, in the worse case, we need to make N � 1 var-iable comparisons to find the highest one and datasize� 1comparisons to find the number of shifts. Finally, we performN � ðdatasize� 1Þ shifts.

In Table 1, we present the computational cost of the basic opera-tions used to evaluate the complexity of our implementation. A sum-mary of the complexity of the steps outlined above is presented inTable 2 (where C is the number of gestures we want to recognize).

The memory cost is given by data size8 � C � ðN2 þ N �M þ NÞ. The

models can be stored either in the MCU RAM, Flash or EEPROM.

6. Experimental setup

The algorithm is validated against a set of four gestures per-formed by four people working in our laboratory. Each tester per-formed 60 repetitions of each gesture for a total of 1200 instances.Even if all the performers are people in the field of computing engi-neering, they have been asked to perform the gestures without anyparticular training except a single initial visual demonstration.

The four gestures selected for our tests are shown in Table 3. Tothe purpose of evaluating the use of our fixed point implementationwith respect to the standard one the set of chosen gesture is not cru-cial. Therefore, we selected a set of movements that can be represen-tative of the ones used to interact with computing systems.

For example the gestures Square and Circle could represent ac-tions like cut or copy on a set of selected items. On the other handthe ‘X’ could mean to delete a set of items and the Flip to change thecontext of an application (e.g., the background).

An analysis of the computational and memory cost as a functionof the number of gestures that have to be detected is presented inthe following sections.

During our test the accelerometer on the SMCube has beensampled at 100 Hz. Raw data have been sent via Bluetooth to abase PC. This enables to use this data set as a reference data set,


and to later assess the effect of various types of data representationand processing through simulations

Manual data segmentation and labeling was done by a test super-visor through a simple acquisition application running on the PC.This allows to obtain reliable ground truth, and to separate the prob-lem of gesture classification from that of data segmentation.

7. Tests and simulations

Our objective is to understand how our implementation, thatuses fixed point data and the proposed scaling technique, performswith respect to a reference implementation that follows the stan-dard algorithm and uses double precision for data representation.Therefore, we used the collected dataset to train a set of HMMsfor each tester using the floating point notation with double preci-sion. Each model has been trained using 15 reference instances, 15loops for the Baum–Welch training algorithm, and 10 initial ran-dom models. We used twofold cross-validation to use the wholeavailable instances for validation. Thus, the instances have been di-vided into two groups. Two models have been trained, each oneusing a different group of instances. The models have been vali-dated on the group of instances not used for training. As a conse-quence we can draw our results on the whole dataset. From nowon we refer to these models as floating point models and we usethem in all tests to provide reference results.

Table 3List of activity to recognize.

Gesture circle. The tester picks up the cube from the table, draws a single, clockwise, circle on the vertical plane in front of him than puts the cube down. Noconstraints have been posed on the circle size

Gwc

Gesture X. The tester picks up the cube from the table, draws an X on the verticalplane in front of him than puts the cube down. The X is executed starting fromthe line beginning on the upper-right corner to the lower-left corner

Gp

The same models have been converted into fixed point notationusing different accuracies (8, 16, and 32 bits) and the accuracy ofthese models has been compared to the one of the floating pointmodels.

Performance is evaluated using the following indexes:

� Correct classification ratio: CCR ¼ number of correctly classified instancestotal number of instances ; is a

global indication of the performance of the classifier.� Precision: PRi ¼ number of instances correctly classified for class i

number of instances classified as class i ; is an indica-tion of the exactness of the classifier.

� Recall: RCi ¼ number of instances correctly classified for class itotal number of instances from class i ; is an indication

of the performance of the classifier over a specific class.

7.1. Parameter selection

Before evaluating our implementation, we selected the thresholdfor ternarization ðDRÞ and the number of states for the HMMs ðNÞ.

In order to select these parameters, for each user, we built sev-eral models sweeping the threshold value from DR ¼ 100 mg toDR ¼ 700 mg with 50 mg step and the number of HMM states from3 to 10. We evaluated the performance of each model in classifyingthe instances from the same user. The couple threshold-number ofstates that resulted in the best average CCR among the four testershas been chosen.

Table 4 shows the results of our simulations. According to theseresults we adopted DR ¼ 250 mg and N ¼ 10.

esture square. The tester picks up the cube from the table, draws a single, clockise, square on the vertical plane in front of him than puts the cube down. No

onstraints have been posed on the square size or the side length

esture flip. The tester picks up the cube from the table, simulate the flip of a bookage ideally placed in front of him than puts the cube down

Table 7Memory cost.

Variables size (bits) Memory cost (bytes)

N ¼ 10 N ¼ 7

8 560 30816 1120 61632 2240 1232


7.2. Data preprocessing cost

The preprocessing of data requires the calculation of the magni-tude and the ternarization of the resulting data. Since the squareroot is a monotone function we can skip this step in the calculationof the magnitude and adapt the threshold �DR to the square val-ues. Therefore, the whole preprocessing requires only three multi-plications and two sums for the magnitude and two compares forthe ternarization. According to the metrics proposed in Table 1, thecomputational cost of this step is presented in Table 5. In the sametable, we present also the number of CPU cycles and the executiontime needed to preprocess the input tern of accelerations.

7.3. Forward algorithm computational and memory cost

We evaluated the time needed for the SMCube microcontrollerto execute one step of the forward algorithm. Table 6 presents thenumber of CPU cycles and the time needed to perform such step at8 MHz.

The results presented in Table 6 refer to a single HMM. Here weare classifying four gestures, as a consequence when sampling thedata at 100 Hz with the MCU running at 8 MHz we cannot imple-ment the 16 and 32 bit version of the algorithm in real time. This isnot a problem because the ATMega168 clock can be increased upto 20 MHz. At this speed the 16 bit version can be executed realtime (see Table 6). Furthermore, we can reduce the sampling rate.With a sampling rate of 20 Hz we are able to implement the 32 bitversion.

The memory cost for the three implementations is presented inTable 7.

Table 4Average CCR for different combination of thresholds and number of states. In bold thebest model.

Th. (mg) Number of states

3 4 5 6 7 8 9 10

100 0.600 0.556 0.566 0.566 0.615 0.574 0.576 0.589150 0.606 0.620 0.616 0.631 0.627 0.639 0.636 0.653200 0.629 0.689 0.636 0.677 0.678 0.695 0.668 0.691250 0.629 0.707 0.694 0.723 0.728 0.730 0.744 0.767300 0.684 0.717 0.713 0.734 0.752 0.744 0.749 0.747350 0.591 0.698 0.746 0.731 0.748 0.755 0.741 0.752400 0.614 0.689 0.711 0.697 0.711 0.713 0.723 0.717450 0.541 0.585 0.686 0.670 0.683 0.702 0.714 0.721500 0.541 0.641 0.708 0.683 0.730 0.703 0.714 0.721550 0.530 0.641 0.670 0.689 0.690 0.700 0.692 0.693600 0.555 0.605 0.635 0.660 0.678 0.667 0.703 0.682650 0.565 0.584 0.594 0.650 0.622 0.650 0.649 0.653700 0.568 0.583 0.625 0.658 0.665 0.642 0.674 0.676

Table 5Preprocessing complexity.

Data size Cost CPU cycles Time ðlsÞ

Preprocessing (8 bit) 16 64 8Preprocessing (16 bit) 55 704 88Preprocessing (32 bit) 123 3120 390

Table 6Execution time of a single step of the forward algorithm with N = 10. This time refersto the evaluation of the atþ1 and the normalization procedure for a single model.

Algorithm CPU cycles T (ms) ð8 MHzÞ T (ms) ð20 MHzÞ

Single step (8 bit) 2088 0.261 0.104Single step (16 bit) 37,288 4.661 1.864Single step (32 bit) 240,240 30.03 12.01

On the SMCube we can store the models either on the flashmemory (as constants), on the EEPROM, or load them at startupand store them on the RAM.

The ATMega168 has enough flash memory to store all version ofthe HMM with N ¼ 10.

If we want to dynamically change the models the whole codestored in flash must be updated using the bootloader. The durationof this operation is approximately 2 min, therefore it may not besuitable for certain applications. To speed up the update process,the models can be stored on the EEPROM or loaded through thewireless channel at startup on RAM. If we use the HMMs with 10states the 8 bit models can be stored on RAM.

However, if we look at Table 4 we see that HMMs with 7–9states present recognition ratio comparable to the 10 states one.Therefore, we can reduce the number of states N while still achiev-ing sufficient recognition rate. For example, the use of HMMs withN ¼ 7 (see Table 7) allows to load on EEPROM the 8 bit models andon RAM both the 8 and 16 bits models. Note that reducing thenumber of states reduces also the computational cost (that de-creases with the square of N, see Table 2), thus we have more re-laxed constraints on CPU clock or on sampling rate.

Note that both the computational and memory cost increaselinearly with the number of gestures to be recognized and withthe square of the number of states (see Section 5.1). Thereforewhen considering applications that use more than four gestureswe can reduce the number of states of the HMMs to fulfill thememory and computational constraint placed by the selectedhardware.

7.4. Single user activity recognition

To assess the influence of using fixed point notation, we evalu-ated the PR, RC, and CCR using a set of gestures form the same per-former whose gesture were used to train the HMMs.

Table 8 presents, as example, the indexes for user 1.As can be seen from these tables the 16 and 32 bits implemen-

tations show performance comparable to the one of the floatingpoint implementation while using only 8 bit for data representa-tion results in more than 20% drop of CCR. Fig. 3 compares theCCR of the fixed point implementations with of the floating point

Table 8Classification performance for user 1.

Class PR 8b PR 16b PR 32b PR fl

(a) PrecisionGesture circle 0.612 0.759 0.759 0.757Gesture square 0.603 0.789 0.789 0.836Gesture X 0.446 0.804 0.804 0.884Gesture flip 0.762 0.875 0.875 0.917

(b) Recall and correct classification ratioGesture circle 0.500 0.683 0.683 0.883Gesture square 0.633 0.933 0.933 0.933Gesture X 0.483 0.683 0.683 0.633Gesture flip 0.800 0.933 0.933 0.917

CCR 0.604% 0.808% 0.808% 0.842%

Fig. 3. Comparison of the correct classification ratio of the fixed point implemen-tation and the floating point one (dashed line) when different variable sizes areused.

Table 10Multiuser scenario classification performance.

Training set Validation set

Usr. 1 Usr. 2 Usr. 3 Usr. 4

(a) Floating point implementationUsr. 1 0.842 0.496 0.475 0.388Usr. 2 0.408 0.858 0.375 0.221Usr. 3 0.438 0.308 0.704 0.358Usr. 4 0.333 0.317 0.254 0.663

(b) 8-bit fixed point implementationUsr. 1 0.604 0.446 0.375 0.321Usr. 2 0.446 0.825 0.354 0.208Usr. 3 0.379 0.292 0.596 0.375Usr. 4 0.333 0.321 0.263 0.642

(c) 16-bit fixed point implementationUsr. 1 0.808 0.504 0.438 0.329Usr. 2 0.396 0.804 0.358 0.192Usr. 3 0.425 0.338 0.683 0.279Usr. 4 0.329 0.235 0.254 0.604

(d) 32-bit fixed point implementationUsr. 1 0.808 0.504 0.438 0.379Usr. 2 0.391 0.800 0.358 0.208Usr. 3 0.425 0.338 0.683 0.346Usr. 4 0.329 0.325 0.254 0.604


one. From this picture, is clear that the 16 and 32 bit solutionsshow accuracies comparable with the floating point one.

According to Table 8 we can use 16 bit for our data representa-tion with minimum recognition accuracy reduction while decreas-ing by 50% the memory cost and by 84% the computational cost(see Tables 6 and 7).

For some users we noticed that the performance of the 8 bitclassifier was higher than the other implementations. This is thecase of user 2, as shown in Table 9.

This behavior is tied to the fact that HMMs are a representativemodel of the gestures based on the training set. Furthermore, theBaum Welch training algorithm does not guarantee that we findthe global maximum of the likelihood, but only a local one. There-fore, the error introduced by the not perfect data representationmay affect the likelihood evaluation in a way that increases therecognition performance on the validation set.

Table 11Confusion matrices in multiuser scenario – performance when tester 2 uses cubetrained by tester 1.

Gesture Classified as

Circle Square X Flip

(a) Floating point (CCR = 0.496)Circle 49 2 8 1Square 3 50 7 0

7.5. Multiple user activity recognition

In addition to previous tests, the performance of the algorithmin a multiuser scenario where a single cube is shared among differ-ent users has been evaluated. Table 10a–d present the CCR whenthe models trained on a user are validated on the other users fordifferent implementations.

These tables show how the selected gesture recognition algo-rithm presents poor results in a multiuser scenario. However thisbehavior is not related to data representation but only on the clas-

Table 9Classification performance for user 2.

Class PR 8b PR 16b PR 32b PR fl

(a) PrecisionGesture circle 0.714 0.643 0.634 0.763Gesture square 0.952 1.000 1.000 1.000Gesture X 0.956 0.953 0.952 0.947Gesture flip 0.725 0.710 0.710 0.742

(b) Recall and correct classification ratioGesture circle 0.750 0.750 0.750 0.750Gesture square 1.000 0.967 0.967 0.967Gesture X 0.717 0.683 0.667 0.900Gesture flip 0.833 0.817 0.817 0.817

CCR 0.825% 0.804% 0.800% 0.858%

sifier used in this study. Therefore, if a variant of the HMM stan-dard model shows better performance, this gain will bepreserved in the fixed point implementation.

We believe that this is related to the choice of features. Sincethe classifier uses the magnitude of the acceleration, it does nothave information regarding the direction. This makes different ges-tures from different users similar depending on how the user per-forms the movement (more or less sharp movements). Anotherchoice of features may improve the system performance.

This can be seen from Table 11a–d where it is reported the Con-fusion matrix relative to the case where performer 2 uses the cubetrained by performer 1.

As can be seen from this tables, the classifier tends to confusegesture X and flip as circle, while gesture square is better recog-nized. Still, the confusion matrices for 16 and 32 bit fixed pointimplementations show better results than the floating point one.

X 57 1 1 1Flip 38 0 3 19

(b) Fixed point 8 bit (CCR = 0.446)Circle 26 9 23 2Square 0 37 23 0X 38 1 19 2Flip 30 0 5 25

(c) Fixed point 16 bit (CCR = 0.504)Circle 44 6 8 2Square 3 48 9 0X 46 6 7 1Flip 32 0 6 22

(d) Fixed point 32 bit (CCR = 0.504)Circle 45 6 8 1Square 3 47 10 0X 46 6 7 1Flip 32 0 6 22


As for the single user scenario, this behavior is a consequence ofthe fact that the Baum–Welch training algorithm does not producethe best possible models and the error introduced by the fixedpoint data representation may increase the CCR.

8. Conclusions

The popularity of TUIs, physical objects used for human–com-puter interaction, is growing together with the development ofVR applications and smart spaces. Effectiveness of TUIs can be en-hanced by the use of smart objects able to sense their status andthe gestures that the user performs with them. The on-board rec-ognition of gestures therefore will play a central role in developingnew TUIs in order to improve object batteries lifetime, system sca-lability and handling of moving TUIs.

In this paper, we presented and characterized an implementa-tion of the HMM forward algorithm suitable for the class of low-power, low-cost MCU typically embedded into TUIs. HMM are stateof the art algorithms for gesture and speech recognition.

The proposed solution can be implemented on the SMCube, atangible interface developed as a building block of the TANGerINEframework. The characterization of our algorithm in both singleand multiuser scenarios demonstrates that the use of fixed pointdata representation results in recognition ratio comparable to thefloating point case when using more than 16 bits.

We evaluated the computational and memory cost of imple-menting a solution able to recognize four gestures with a set of10 states HMMs. We show that the flash memory available onthe SMCube is enough to store all version of the model, and thatif fast recognition capabilities are needed we can use HMMs witha lower number of states without excessive recognition loss. Byincreasing the CPU clock to 20 MHz we show that the 16 bit ver-sion of the algorithm can be executed in real time at 100 Hz sam-pling rate. By decreasing the sampling rate we are also able toimplement the 32 bit version and thus achieve better accuracy.

Even if the classification algorithm in some cases does notachieve extremely high accuracy, this behavior is not related tothe fixed point implementation but to the chosen algorithm itself.Thus, if a different version of HMM shows a better classification ra-tio it can be implemented using at least 16-bit fixed point operandson the SMCube without minimal performance loss.

Acknowledgments

Part of this work has been supported by TANGerINE project(www.tangerineproject.org), by ARTISTDESIGN project funded un-der EU FP 7 (Project Reference: 214373) (www.artist-embed-ded.org/artist/) and by SOFIA project funded under the EuropeanArtemis programme SP3 Smart environments and scalable digitalservice (Grant agreement: 100017) (www.sofia-project.eu).

References

[1] Atmel, Atmel products, 2009. Available from: <http://www.atmel.com/products/>.

[2] V. Baier, L. Mosenlechner, M. Kranz, Gesture classification with hierarchicallystructured recurrent self-organizing maps, in: Networked Sensing Systems,INSS’07, Fourth International Conference, 2007, pp. 81–84.

[3] G. Bailador, D. Roggen, G. Tröster, G. Triviño, Real time gesture recognitionusing continuous time recurrent neural networks, in: BodyNets’07:Proceedings of the ICST Second International Conference on Body Areanetworks, ICST, Brussels, Belgium, Belgium, ICST (Institute for ComputerSciences, Social-Informatics and Telecommunications Engineering), 2007, pp.1–8.

[4] S. Baraldi, A. Del Bimbo, L. Landucci, A. Valli, wikitable: finger driveninteraction for collaborative knowledge-building workspaces, in: CVPRW’06:Proceedings of the 2006 Conference on Computer Vision and PatternRecognition Workshop, IEEE Computer Society, Washington, DC, USA, 2006,p. 144.

[5] S. Baraldi, A. Del Bimbo, L. Landucci, N. Torpei, O. Cafini, E. Farella, A. Pieracci, L.Benini, Introducing tangerine: a tangible interactive natural environment, in:Proceedings of ACM International Conference on Multimedia (MM), ACM Press,Augsburg, Germany, 2007, pp. 831–834.

[6] Bluegiga, Bluegiga bluetooth modules, 2009. Available from: <http://www.bluegiga.com/Bluetooth_Modules/>.

[7] M. Bruns Alonso, V. Keyson, Musiccube: a physical experience with digitalmusic, in: Personal Ubiquitous Computing, vol. 10, Springer-Verlag, London,UK, 2006, pp. 163–165.

[8] O. Cafini, Z. Piero, E. Farella, L. Benini, S. Baraldi, N. Torpei, L. Landucci, A. DelBimbo, Tangerine smcube: a smart device for human computer interaction, in:Proceedings of IEEE European Conference on Smart Sensing and Context, IEEEComputer Society, 2008.

[9] K. Camarata, E.Y.-L. Do, M.D. Gross, B.R. Johnson, Navigational blocks: tangiblenavigation of digital information, in: CHI’02: CHI’02 Extended Abstracts onHuman Factors in Computing Systems, ACM, 2002, pp. 752–753.

[10] C. Chen, J. Liang, H. Zhao, H. Hu, J. Tian, Factorial hmm and parallel hmm forgait recognition, in: Systems, Man, and Cybernetics, Part C: Applications andReviews, IEEE Transactions, vol. 39, 2009, pp. 114–123.

[11] E. Farella, A. Pieracci, A. Acquaviva, Design and implementation of wimocanode for a body area wireless sensor network, in: Systems Communications,Proceedings, 2005, pp. 342–347.

[12] F.G. Hofmann, P. Heyer, G. Hommel, Velocity profile based recognition ofdynamic gestures with discrete hidden markov models, in: Proceedings of theInternational Gesture Workshop on Gesture and Sign Language in Human–Computer Interaction, Springer-Verlag, London, UK, 1998, pp. 81–95.

[13] C.-R. Huang, C.-S. Chen, P.-C. Chung, Tangible photorealistic virtual museum,in: IEEE Computer Graphics and Applications, vol. 25, IEEE Computer SocietyPress, Los Alamitos, CA, USA, 2005, pp. 15–17.

[14] H. Ishii, The tangible user interface and its evolution, in: Communication of theACM, vol. 51, ACM, New York, NY, USA, 2008, pp. 32–36.

[15] H. Ishii, B. Ullmer, Tangible bits: towards seamless interfaces betweenpeople, bits and atoms, in: CHI’97: Proceedings of the SIGCHI Conferenceon Human Factors in Computing Systems, ACM, New York, NY, USA, 1997,pp. 234–241.

[16] K. Jeong, J. Won, C. Bae, User activity recognition and logging in distributedintelligent gadgets, in: Multisensor Fusion and Integration for IntelligentSystems, MFI 2008, IEEE International Conference, 2008, pp. 683–686.

[17] S. Jordà, M. Kaltenbrunner, G. Geiger, R. Bencina, The reactable, in: Proceedingsof the International Computer Music Conference (ICMC 2005), Barcelona,Spain, 2005.

[18] M. Kaltenbrunner, R. Bencina, Reactivision: a computer-vision framework fortable-based tangible interaction, in: TEI’07: Proceedings of the FirstInternational Conference on Tangible and Embedded Interaction, ACM, 2007,pp. 69–74.

[19] E. Keogh, S. Chu, D. Hart, M. Pazzani, An online algorithm for segmenting timeseries, in: Proceedings of the IEEE International Conference on Data Mining,2001, pp. 289–296.

[20] L. Kim, H. Cho, S.H. Park, M. Han, A tangible user interface with multimodalfeedback, in: Twelfth International Conference on Human–ComputerInteraction HCI, 2007, pp. 94–103.

[21] M. Ko, G. West, S. Venkatesh, M. Kumar, Online context recognition inmultisensor systems using dynamic time warping, in: Proceedings ofConference on Intelligent Sensors, Sensor Networks and InformationProcessing Conference, 2005, pp. 283–288.

[22] S. Mann, ‘‘Smart clothing”: wearable multimedia computing and ‘‘personalimaging” to restore the technological balance between people and theirenvironments, in: MULTIMEDIA’96: Proceedings of the Fourth ACMInternational Conference on Multimedia, ACM, New York, NY, USA, 1996, pp.163–174.

[23] K. Matthias, S. Dominik, H. Paul, S. Albrecht, A display cube as tangible userinterface, in: In Adjunct Proceedings of the Seventh International Conferenceon Ubiquitous Computing (Demo 22), 2005.

[24] A. Mazalek, M. Reynolds, G. Davenport, Tviews: an extensible architecture formultiuser digital media tables, in: Computer Graphics and Applications, IEEE,vol. 26, 2006, pp. 47–55.

[25] M.A. Mendoza, N. Pérez De La Blanca, Applying space state models in humanaction recognition: a comparative study, in: AMDO’08: Proceedings of the FifthInternational Conference on Articulated Motion and Deformable Objects,Springer-Verlag, Berlin, Heidelberg, 2008, pp. 53–62.

[26] Microsoft Corporation, Microsoft surface, 2009. Available from: <http://www.microsoft.com/SURFACE/index.html/>.

[27] S. Moon, J.-N. Hwang, Robust speech recognition based on joint model andfeature space optimization of hidden markov models, in: IEEE Transactions onNeural Networks, vol. 8, March 1997, pp. 194–204.

[28] Nintendo, Wii homepage, 2008. Available from: <http://wii.com/>.[29] S. Oh, W. Woo, Manipulating multimedia contents with tangible media control

system, in: Third International Conference on Entertainment Computing ICEC,2004, pp. 57–67.

[30] C. O’Malley, D. Stanton Fraser, Literature Review in Learning with TangibleTechnologies, Technical Report, Learning Sciences Research Institute,University of Nottingham, Department of Psychology, University of Bath,2004.

[31] K.S. Park, H.S. Cho, J. Lim, Y. Cho, S.-M. Kang, S. Park, Learning cooperation in atangible moyangsung, in: R. Shumaker (Ed.), HCI (14), vol. 4563, Lecture Notesin Computer Science, Springer, 2007, pp. 689–698.

http://www.tangerineproject.org

http://www.artist-embedded.org/artist/

http://www.artist-embedded.org/artist/

http://www.sofia-project.eu

http://www.atmel.com/products/

http://www.atmel.com/products/

http://www.bluegiga.com/Bluetooth_Modules

http://www.bluegiga.com/Bluetooth_Modules

http://www.microsoft.com/SURFACE/index.html

http://www.microsoft.com/SURFACE/index.html

http://wii.com/


[32] L.R. Rabiner, A tutorial on hidden Markov models and selected applications inspeech recognition, in: Proceedings of the IEEE, vol. 77, 1989, pp. 257–285.

[33] M. Rauterberg, T. Mauch, R. Stebler, The digital playing desk: a case study foraugmented reality, in: Robot and Human Communication, Fifth IEEEInternational Workshop, 1996, pp. 410–415.

[34] D. Roggen, N. Bharatula, M. Stäger, P. Lukowicz, G. Tröster, From sensors tominiature networked sensor buttons, in: Proceedings of the Third InternationalConference on Networked Sensing Systems, INSS 2006, 2006, pp. 119–122.

[35] A. Smailagic, D.P. Siewiorek, U. Maurer, A. Rowe, K.P. Tang, ewatch: contextsensitive system design case study, in: ISVLSI’05: Proceedings of the IEEEComputer Society Annual Symposium on VLSI: New Frontiers in VLSI Design,IEEE Computer Society, Washington, DC, USA, 2005, pp. 98–103.

[36] Sony, Eyetoy homepage, 2003. Available from: <http://www.eyetoy.com/index.asp/>.

[37] T. Stiefmeier, G. Ogris, H. Junker, P. Lukowicz, G. Tröster, Combining motionsensors and ultrasonic hands tracking for continuous activity recognition in amaintenance scenario, in: Tenth IEEE International Symposium on WearableComputers, 2006.

[38] T. Stiefmeier, D. Roggen, G. Ogris, P. Lukowicz, G. Tröster, Wearable activitytracking in car manufacturing, in: IEEE Pervasive Computing, vol. 7, 2008, pp.42–50.

[39] STM, Stm tri-axial accelerometer, 2009. Available from: <http://www.st.com/stonline/products/literature/ds/11115.htm/>.

[40] M. Tahir, G. Bailly, E. Lecolinet, Aremote: a tangible interface for selecting TVchannels, in: Artificial Reality and Teleexistence, in: Seventh InternationalConference, 2007, pp. 298–299.

[41] K. Ueda, A. Kosaka, R. Watanabe, Y. Takeuchi, T. Onoye, Y. Itoh, Y.Kitamura, F. Kishino, m-activecube; multimedia extension of spatialtangible user interface, in: Proceedings of the Second InternationalWorkshop on Biologically Inspired Approaches to Advanced Information(BioADIT), 2006, pp. 363–370.

[42] J. Verhaegh, W. Fontijn, A. Jacobs, On the benefits of tangible interfaces foreducational games, in: Digital Games and Intelligent Toys Based Education,Second IEEE International Conference, 2008, pp. 141–145.

[43] R. Watanabe, Y. Itoh, Y. Kitamura, F. Kishino, H. Kikuchi, Distributedautonomous interface using activecube for interactive multimedia contents,in: ICAT’05: Proceedings of the 2005 International Conference on AugmentedTele-existence, ACM, New York, NY, USA, pp. 22–29.

[44] M. Weiser, The Computer for the 21st Century, vol. 265, Scientific American,1991, pp. 66–75.

[45] P. Zappi, T. Stiefmeier, E. Farella, D. Roggen, L. Benini, G. Troster, Activityrecognition from on-body sensors by classifier fusion: sensor scalability androbustness, in: Intelligent Sensors, Sensor Networks and Information, ISSNIP2007, Third International Conference, 2007, pp. 281–286.

[46] L. Zeller, L. Scherffig, Cubebrowser: a cognitive adapter to explore mediadatabases, in: CHI EA’09: Proceedings of the 27th International ConferenceExtended Abstracts on Human factors in Computing Systems, ACM, 2009, pp.2619–2622.

[47] S. Zhong, J. Ghosh, Hmms and coupled hmms for multi-channel eegclassification, in: Neural Networks, IJCNN’02, Proceedings of the 2002International Joint Conference, vol. 2, 2002, pp. 1154–1159.

http://www.eyetoy.com/index.asp

http://www.eyetoy.com/index.asp

http://www.st.com/stonline/products/literature/ds/11115.htm

http://www.st.com/stonline/products/literature/ds/11115.htm

Hidden Markov Model based gesture recognition on low-cost, low-power Tangible User Interfaces

Documents

Transcript of Hidden Markov Model based gesture recognition on low-cost, low-power Tangible User Interfaces