Well-Balanced Usability \u0026amp; Annotation Complexity in Interactive Video Semantization

8
Well-balanced Usability & Annotation Complexity in Interactive Video Semantization Yiwei Cao 1 , Dominik Renzel 1 , Matthias Jarke 1 , Ralf Klamma 1 , Michael Lottko 1 Georgios Toubekis 2 , and Michael Jansen 2 1 Information Systems & Database Technology, RWTH Aachen University Ahornstr. 55, 52056 Aachen Germany {cao|renzel|jarke|klamma|lottko}@dbis.rwth-aachen.de 2 Chair of Urban History, RWTH Aachen University, Germany Templergraben 49, 52056 Aachen Germany {toubekis|jansen}@sbg.rwth-aachen.de Abstract—In the recent years more and more people have begun to edit video files, as technologies are becoming more sophisticated and affordable. Web 2.0 has raised tagging functionality to a growing number of websites such as Flickr and YouTube. However, these services only provide basic video annotation support. In comparison to those well-known services there are many research efforts towards video semantization tools. These tools provide highly precise annotation functionality based on metadata standards such as MPEG-7, but tend to exhibit very complex user interfaces. In this paper we present the design, implementation and evaluation of SeViAnno, an MPEG-7 based interactive semantic video annotation Web platform with the main objective to find a well-balanced trade-off between a simple user interface and video semantization complexity. Keywords-Web 2.0, multimedia, MPEG-7, interactivity, multimedia semantization, web services, RIA, usability, mashup I. INTRODUCTION The amount of user generated online videos is rapidly growing. Therefore the importance to be able to annotate videos is also growing. The high interest for this functionality is observable in popular web services such as YouTube or Flickr providing support for adding tags and other metadata to videos and images. Both systems are very easy to handle and expose a high degree of usability. In contrast, applications developed on the basis of comparably heavy-weight metadata standards such as MPEG-7 [1] are very precise, but often hard to use. Furthermore, one can find certain information more easily. Videos need to be searchable based on semantic information. Useful pieces of additional information are place, time, or persons. In addition, the community aspect plays an important role in the interactive multimedia semantization process. In addition, research fields such as architecture or archaeology depend on the so called Cult of the Amateur [2]. The rapid development of RIA technologies enables the creation of web applications which are accessible worldwide with the same look and feel as desktop applications. As Cult of the Amateur is very important for architecture, this work was developed together with specialists for cultural heritage management. Within the context of this work, we encountered manifold problems with current video semantization systems. Most of them are developed for rather generic use than for a special application domain. Another problem related to the aforementioned generality is the complexity of terminology. Different communities cultivate terminologies on different levels of complexity. In that sense, semi-professionals do not share the same terminology as professionals and should thus be supported for instance by a thesaurus [3]. Furthermore, the motivation to collaboratively annotate multimedia has to be increased following the concept of Communities of Practice [4]. Another problem of many applications is their poor usability. Shacklett showed that 28% of e-commerce web transactions are resulting in user errors and subsequent frustration. 6% of the visitors leave the website because of errors stating that they did not want to use it anymore [5]. One of the biggest problems is accuracy, as popular websites only provide basic annotation support such as keyword tagging [6] or simple text descriptions. Furthermore, users do not want to care about the video codec, although there are so many different ones available. In this paper we present the design, the prototypic implementation and an evaluation of SeViAnno, an MPEG-7 based video annotation tool as an approach to overcoming the aforementioned problems with a main focus on finding an appropriate trade-off between user interface simplicity and semantic video annotation precision. This paper is structured as follows. Section II provides an overview of related concepts and technologies. An analysis of five video semantization tools is presented with their strengths and weaknesses. Section III is dedicated to the design process of SeViAnno as part of the Virtual Campfire [7] scenario in collaboration with the Bamiyan Valley Development Community 1 . The prototype is a proof of concept for the collaborative production and sharing of multimedia and metadata using Web 2.0 [8] multimedia community tools in cultural heritage. Section IV presents the technical details of the implementation including information on the whole architecture and the interplay between its parts. We provide a 1 http://www.bamiyan-development.org/ 978-1-4244-7566-7/10/$26.00 ©2010 IEEE

Transcript of Well-Balanced Usability \u0026amp; Annotation Complexity in Interactive Video Semantization

Well-balanced Usability & Annotation Complexity in Interactive Video Semantization

Yiwei Cao1, Dominik Renzel1, Matthias Jarke1, Ralf Klamma1, Michael Lottko1 Georgios Toubekis2, and Michael Jansen2

1Information Systems & Database Technology, RWTH Aachen University Ahornstr. 55, 52056 Aachen Germany

{cao|renzel|jarke|klamma|lottko}@dbis.rwth-aachen.de 2Chair of Urban History, RWTH Aachen University, Germany

Templergraben 49, 52056 Aachen Germany {toubekis|jansen}@sbg.rwth-aachen.de

Abstract—In the recent years more and more people have begun to edit video files, as technologies are becoming more sophisticated and affordable. Web 2.0 has raised tagging functionality to a growing number of websites such as Flickr and YouTube. However, these services only provide basic video annotation support. In comparison to those well-known services there are many research efforts towards video semantization tools. These tools provide highly precise annotation functionality based on metadata standards such as MPEG-7, but tend to exhibit very complex user interfaces. In this paper we present the design, implementation and evaluation of SeViAnno, an MPEG-7 based interactive semantic video annotation Web platform with the main objective to find a well-balanced trade-off between a simple user interface and video semantization complexity.

Keywords-Web 2.0, multimedia, MPEG-7, interactivity, multimedia semantization, web services, RIA, usability, mashup

I. INTRODUCTION The amount of user generated online videos is rapidly growing. Therefore the importance to be able to annotate videos is also growing. The high interest for this functionality is observable in popular web services such as YouTube or Flickr providing support for adding tags and other metadata to videos and images. Both systems are very easy to handle and expose a high degree of usability. In contrast, applications developed on the basis of comparably heavy-weight metadata standards such as MPEG-7 [1] are very precise, but often hard to use.

Furthermore, one can find certain information more easily. Videos need to be searchable based on semantic information. Useful pieces of additional information are place, time, or persons. In addition, the community aspect plays an important role in the interactive multimedia semantization process.

In addition, research fields such as architecture or archaeology depend on the so called Cult of the Amateur [2]. The rapid development of RIA technologies enables the creation of web applications which are accessible worldwide with the same look and feel as desktop applications. As Cult of the Amateur is very important for architecture, this work was developed together with specialists for cultural heritage management.

Within the context of this work, we encountered manifold problems with current video semantization systems. Most of them are developed for rather generic use than for a special application domain. Another problem related to the aforementioned generality is the complexity of terminology. Different communities cultivate terminologies on different levels of complexity. In that sense, semi-professionals do not share the same terminology as professionals and should thus be supported for instance by a thesaurus [3]. Furthermore, the motivation to collaboratively annotate multimedia has to be increased following the concept of Communities of Practice [4]. Another problem of many applications is their poor usability. Shacklett showed that 28% of e-commerce web transactions are resulting in user errors and subsequent frustration. 6% of the visitors leave the website because of errors stating that they did not want to use it anymore [5]. One of the biggest problems is accuracy, as popular websites only provide basic annotation support such as keyword tagging [6] or simple text descriptions. Furthermore, users do not want to care about the video codec, although there are so many different ones available.

In this paper we present the design, the prototypic implementation and an evaluation of SeViAnno, an MPEG-7 based video annotation tool as an approach to overcoming the aforementioned problems with a main focus on finding an appropriate trade-off between user interface simplicity and semantic video annotation precision.

This paper is structured as follows. Section II provides an overview of related concepts and technologies. An analysis of five video semantization tools is presented with their strengths and weaknesses. Section III is dedicated to the design process of SeViAnno as part of the Virtual Campfire [7] scenario in collaboration with the Bamiyan Valley Development Community1. The prototype is a proof of concept for the collaborative production and sharing of multimedia and metadata using Web 2.0 [8] multimedia community tools in cultural heritage. Section IV presents the technical details of the implementation including information on the whole architecture and the interplay between its parts. We provide a

1 http://www.bamiyan-development.org/

978-1-4244-7566-7/10/$26.00 ©2010 IEEE

brief evaluation of the prototype in Section V and conclude the paper with an outlook to further work in Section VI.

II. RELATED WORK In this section we describe the most important concepts and the state of the art for key technologies for SeViAnno including Semantic Web, Rich Internet Applications, Metadata, and current video annotation tools.

A. Concepts and Terminologies Semantization is used to describe the meaning of text, images and videos. The main goal of the Semantic Web is to enable machine readability and automatic processing of the meaning of multimedia [9].

A substantial part of the Semantic Web is constituted by the usage of metadata to describe the content of multimedia. Metadata can be categorized into descriptive metadata (e.g. title, topic, etc.), user metadata (e.g. author, rights, etc.) and technical metadata (e.g. date, format, etc.). There exist already many applications utilizing such information, e.g. in surveillance, television broadcasting, or learning-systems. This lead to the development of different standards, where MPEG-7 is the most complete one [1]. Its first version was released in 2001 and has since then evolved continuously. It can be applied to both real-time and non-real-time systems, and is not limited to any field of application or file format. MPEG-7 consists of twelve parts, where part five (Multimedia Description Schemes) is the most important for our work. This part defines a set of description tools which deal with multimedia entities. These tools can be grouped into five classes: content description, content management, content organization, navigation & access, and user interaction.

The MPEG-7 standard is featured with easy integration into a system [10]. However, MPEG-7 is also a very complex standard, and most parts are often not needed. Therefore, it may be better to use the MPEG-7 profiles defined in Part 9 of the standard or the MPEG-4 LASeR for mobile purposes [11]. Another standard is the DCMI Metadata Terms by the Dublin Core Metadata Initiative [12]. A subset of this standard is the Dublin Core Metadata Element Set, which consists of 15 elements and is mainly used by libraries and museums [13].

Rich Internet Applications (RIA) aim to provide intuitive user interfaces and complex applications, which are platform independent and widely accessible. The term was coined by Macromedia in 2001 and described as an appealing, interactive, slim and flexible web application [14]. In 2004, Macromedia introduced the first version of Flex, which was not very successful due to high pricing and the lack of an IDE. RIA became much more popular with the introduction of Flex 2 in 2006. Currently, Adobe has released Flex 4 [15]. Other technologies to create RIAs are Ajax [16] or Microsoft Silverlight [17].

A Mashup is a website or application which combines data and functionality of different services to create a new service, e.g. visualize restaurant locations using Google Maps [18].

Interactivity can be described as a communication concept. A detailed definition can be given in different ways. The social based concept of interaction describes interactivity as each medium, which follows the concept of conversation (e.g. telephone, TV, computer, etc.). A different definition defines interactivity as a criterion. In this case interactivity describes the active involvement of the user to control the behavior of technology. This concept should lead to individualized usage of technologies [19].

Community of Practice describes the concept of social learning. The members are persons who share the same field of interest or work together. Their aims, tasks and ways of communication are set up by themselves. In the next phase knowledge is established and shared among the members. The roles of the different members are also defined by themselves. In the end, the Community of Practice should lead to more efficient problem solving. Nevertheless, it is possible that the opposite will take place [4].

B. Related Systems We especially analyzed the main differences between intuitive and precise tools. First, we describe two intuitive tools, and then three complex tools.

YouTube provides the possibility to add text annotations to video files. This function can be activated by users uploading videos. This function is often used to add comments or hyperlinks or even abused for advertisements. Annotations during watching the video can be easily done with no more than two clicks. Nevertheless, annotation complexity is very limited (only free text annotation) and does not support video search and retrieval by annotations.

VideoANT was developed by the University of Minnesota and can be used for text annotations. Like YouTube, it is a web-based service, which allows annotating flash videos (cf. Figure 1). Adding an annotation requires only one click and is convenient to use. All annotations are listed on the right side of the video and can be accessed all the time [20].

Figure 1. Screenshot of VideoANT

Semantic Video Annotation Suite was developed at the Joanneum Research Institute and is a desktop-based application. It fully supports the MPEG-7 standard and provides automatic annotation support such as object recognition and object tracking. In comparison to the aforementioned tools, the user interface is more complex and needs a steeper learning curve [21].

IBM VideoAnnEx Annotation Tool provides MPEG-7 based semantic video annotation. In addition to using the MPEG-7 semantic base types, this tool allows the user to create one’s own library of types. However, the user interface is comparably complex [22].

M-OntoMat-Annotizer developed by aceMedia allows MPEG-7 based video and image annotations (cf. Figure 2). An additional feature is the automatic region recognition, which separates every video frame into smaller elements. It has the most complex user interface of all presented tools [23].

Figure 2. Screenshot of M-OntoMat-Annotizer

Comparing intuitive and precise tools they differ strongly in the user interface. VideoANT has only a few buttons and users can add and delete annotations quickly. Precise tools tend to overload their user interface with a lot of buttons and possibilities. The less professional user will feel uncomfortable and overwhelmed when looking at these tools. Therefore, the user interface should be very clearly structured and hide more complex annotation functionality from normal users.

III. DESIGN OF SEVIANNO We have aimed to deliver a guideline for interactive semantization of multimedia using RIA in the field of cultural heritage. Therefore we took the existing problems into account. We have been working together with the Center for Documentation and Conservation at RWTH Aachen for requirements analysis to elicit functional and non-functional requirements. Finally we present the system architecture.

As the first step of the requirements analysis we created a paper prototype [24] (cf. Figure 3) prior to first implementation in order to avoid major interface flaws. Results were considered for both the implementation and the features. The

paper prototype consisted of printed elements and Post-its. Post-its were used to simulate interactivity. The prototype was tested with 30 persons with different experience backgrounds in technology use. Their ages ranged from 11 to 72 years.

First, we tested if the actions caused by clicking buttons were clearly predictable and did not have any surprising effects. The concept of tags was especially unclear to subjects without any computer background. The Google Maps integration was received very well and therefore most people could directly understand the underlying concept of having location as one type of semantic annotation.

At a second phase, we tested if the subjects were able to solve a given task (e.g. find the annotation “coin”). At this phase we found out that a clear terminology and as few mouse clicks as possible help to increase the usability and understandability. Too many submenus would frustrate users and lead to a decrease of motivation.

At a third phase, we tested different icons intended to represent different semantic tags based on MPEG-7 semantic base types. We provided the users with printed versions of the icons which they had to sort on a sorting field. As a result we could evaluate the clearness of the icons. The most difficult icon to recognize was the tag icon.

Figure 3. Paper Prototype for SeViAnno

All test results were documented on a predefined form to simplify the evaluation. The last part of the form included space for generic comments. It was very often mentioned to include a timeline for the time specific annotations to visualize these base types in a better way, like it is already done with the place base types on a map. Although paper prototyping is a time consuming process, it proved to be excellent for increasing usability and for avoiding major flaws before the actual implementation.

A. Functional Requirements Analysis Based on the use case (cf. Figure 4), analysis of current products and the paper prototype, the SeViAnno web service should include the following features to solve the main problems mentioned before:

Annotation: It should be possible to create, edit and delete annotations. The most important part is creating annotations. Therefore, adding a description and keywords for the overall video must be supported. A much more important function is the ability to separate videos into sequences and adding keywords or certain semantic types. Usual text input should be supported. Furthermore, optimized UI elements for the input of certain semantic tag types should be supported to achieve a better usability. For instance, adding place information such as longitude and latitude can be supported by selecting a location on a map. Pre-entered annotations such as historic persons can be shown in a list and can be added by drag & drop. An editing mode can be accessed by clicking on the “edit” button. Terminology support can be included by adding a tooltip box.

Display: Existing annotations should be displayable in different ways:

1. Overlaying annotations on the video player.

2. A list of all annotations. If a user clicks on an annotation, the video seeks to the corresponding position.

3. Place information can be indicated with marks on a map (e.g. Google Maps). A click on a mark should have the same behavior as in the list.

4. Timeline for time annotations.

User Management: This part includes creating, editing and deleting of users. The most important functions are registration and login. It should be possible to differentiate users by the community they belong to. To support the Communities of Practice, a rating system for annotations was considered useful. Every user would get an overall rating and people would begin to compete to get better ratings than the other users.

Video Upload: The user should be able to upload video files to the service. Already uploaded video files can be annotated by every user belonging to the corresponding community. The file format should be no barrier and therefore the web service needs to be capable of server-side video encoding.

Figure 4. SeViAnno use cases for agents in a cultural heritage scenario

Additional potential features are a search function, object tracking, and object tracing.

B. Non-functional Requirements Usability: Usability [25] provides an easy and efficient way of interaction with the web application. Therefore, the needs of the main users should be regarded. To find out the users’ needs, user tests can be conducted and evaluated [26]. Heuristics like the design guidelines by Nielsen [27] can be additionally applied, e.g. Learnability and User Satisfaction. McLaughlin and Skinner reduced Usability to six basic components [28]:

1. Checkability: System checks or allows checking to ensure that correct information is going in and out.

2. Confidence: Users have confidence using the system and in themselves being able to use the system.

3. Control: Users have control over the system and especially over the data streams.

4. Ease of Use: It is very simple to use the system.

5. Speed: System can be used fast.

6. Understanding: System and especially the output are understandable.

Simplicity can be achieved by reducing the amount of needed clicks to perform a task. Understandability can be increased by using clear terminology and visual help such as with tooltips [27].

Portability: Platform independence is required to solve problems with different operating systems or devices used by members of the same community. This can be achieved by using Adobe Flex or other RIA technology.

Robustness: The service should be reliable and avoid a high amount of failures. Especially possible false user actions should have no negative influence on the application.

Speed: Interaction should be fluent and not annoying. Working with video files in a slow way can become very annoying and should therefore be avoided.

Mobility: A very important aspect of future development of web applications. Developers should prepare their applications for mobile use.

C. Semantization The main elements of semantic information in videos dealing with immobile cultural heritage are persons, places, time, buildings and objects. Therefore, the most important semantic base types are the following ones:

Agent describes persons by their full name and an optional email address. Object is a very generic term and should be used for daily objects, artifacts, historic objects, etc. Both the name and its description are stored. Concept is some information ungraspable and formed in the mind like a thought or a notion, in contrast to Object. For example, culture or religion can be considered as concepts. Event can be stored using the name, date and location. Its location is saved by longitude and latitude. An example for an event is a battle during WWII. To achieve unity among event names, a predefined list can be used. Place describes the location of the action or the location

which is shown in the video. It is stored with a name, longitude and latitude. Time describes a date. It can describe the date of an action or an event and is stored with a name, duration and date.

In addition to the semantic tags, it should be possible to add simple keywords and descriptions to video segments. For the beginning, the chosen base types are sufficient for use in the context of cultural heritage. Nevertheless, it is possible to construct complete type hierarchies in MPEG-7, but it is questionable whether this would not contradict the requirement for simplicity.

D. Interactivity A first idea for interactivity realization is to show all existing annotations as a tag cloud [29]. Small icons/images should help identify the type of the annotation. It should be possible to drag an annotation of the tag cloud and drop it on the video in order to simplify the annotation process. Then a dialog should appear to define the end of the sequence and it should reuse the semantic base type. A list of all existing annotations should be available, too. This will improve the overview and make it possible to navigate through the video using this list. The sorting of the list should be changeable. Annotations containing place information should be shown on a map indicated by markers. Hovering on a marker delivers information about name, time point and duration. A click on the markers enables navigation in the video by using the place annotations. To implement this functionality, the mashup concept is applicable using either Microsoft Virtual Earth or Google Maps. The map is used to simplify adding place information.

E. Architecture The SeViAnno architecture (cf. Figure 5) is based on a classical 3-Tier Model. The three components are:

Client Tier: The client tier constitutes the SeViAnno frontend and consists of the main elements User, Video Player and Annotation. The Video Player provides the possibility to watch, change and upload videos. The User element is responsible for user management (e.g. registration, creation and viewing profile). The Annotation element is responsible for the biggest part of the client tier. It contains the whole interface for adding, editing and viewing annotations. Parts are Map, Timeline, View and Edit.

Business Tier: The business tier provides all services, which are used by the client tier. Video management is mostly done by Flex components. Annotation management is realized by LAS [30] services, which will be later explained in more detail. Especially the MPEG-7 Semantic Base Type Service, MPEG-7 Multimedia Content Service, and User Manager Service are used. To allow video segmenting it is necessary to extend the MPEG-7 Multimedia Content Service with additional methods.

Resource Tier: This tier consists of elements which are mainly existent at our local infrastructure. MPEG-7 data and security-relevant data are stored in XML databases. To store,

encode and stream video files we use the Wowza2 Media Server 2 with RTMP protocol support to stream to the flash video player.

Figure 5. SeViAnno Architecture

IV. IMPLEMENTATION In this section we describe the implementation of the SeViAnno prototype based on previous requirements analysis. We explain in detail the used software, data sets and problems we encountered during the development.

A. Implementation of the Client The most important part of the Flex implementation is the complete annotation functionality which includes creation, display and editing. All annotations are shown in a list sortable by type, time point or alphabetically. Place annotations are shown on the lower right side in the integrated Google Map. Thereby every annotation is indicated by a marker. By clicking on a marker or an element in the list, the video is played to the corresponding time point directly. The description and keywords are displayed on the upper right side of the application (cf. Figure 6).

The second aspect is the creation of annotations. Annotations are added by a simple click on the name of the needed base type along with the additional input of a text description. Adding a place is done in the Google Maps part of the interface either by specifying the name of the searched location in a search field or by directly clicking on a point on the map. Longitude and latitude are provided automatically.

The client was realized using Adobe Flex/Flash 4 with the advantage to be platform independent and to be prevalent on the market. All products developed in Flex are accessible via a standard Flash plug-in. Furthermore, Flex is a very sophisticated framework with many libraries. For instance, the video player existed already and only needed minor adjustments. It is the leading technology for online video applications. The Flash Builder 4 IDE is based on Eclipse and did not need any adaptation before use.

2 http://www.wowzamedia.com/

B. LAS MPEG-7 Services Since our Lightweight Application Server (LAS) [30] already provided a range of operations on MPEG-7 descriptions including their persistence in a native XML database, developers can mainly concentrate on client side development of SeViAnno. Communication for RMI is thereby realized using a LAS connector for the HTTP protocol. A set of LAS core services already realized the management of users, communities and access control, and a website for LAS registration already existed.

Figure 6. Screenshot of the SeViAnno Web interface

MPEG-7 Semantic Base Type Service: This service is designed for the management and persistence of MPEG-7 semantic base types, which can serve as semantic tags assigned to multimedia or segments thereof.

MPEG-7 Multimedia Content Service: This service is designed for annotations of complete multimedia files such as images and videos. Within the context of this work we have realized an extension of this service towards the segmentation of videos employing the Audio Visual Segment Temporal Decomposition Type of the MPEG-7 standard. The Audio Visual Segment Type is now used to add the semantic references of the base types corresponding to the video segments. To save the time point and the duration of the segment we employ the MPEG-7 Media Time type as a description tool.

The extension of the service has been realized using XMLBeans, which is used in most of the LAS services. As an IDE we use the latest Eclipse IDE for Java Developer 3.5. In summary, the LAS framework helps solve the problems of impreciseness of intuitive annotation tools.

C. Video To be able to use the video files for streaming to the Flash application it is necessary to convert a file, if it is not encoded in the H.264 codec. Converting the video file is done by FFMPEG [31], a free software used on most platforms to record and convert videos. In addition, it supports all important video codecs.

Then it is copied to the Wowza Media Server 2, where it is streamed using the Adobe RTMP protocol. Wowza Media Server 2 allows us to use the same video files for different services, as it supports many streaming protocols, e.g. Adobe RTMP, Apple HTTP Streaming, Microsoft Smooth Streaming, RTSP/RTP and the MPEG2 Transport Protocol. For testing the application we used four different video files, which are all based on the Gandhara culture, as the Center for Documentation and Conservation at RWTH Aachen presented an exhibition about this culture [32].

Figure 7. Example of media description created by SeViAnno

D. Reflections on the Implementation Phase Due to the high amount of different services we encountered a lot of policy problems. For instance, Flash restricts the access of the client to other servers. Furthermore, using the Flex SDK has showed a problem with the Google Maps API for Flex, as it is not as sophisticated as the API for Ajax. For example, reverse geocoding is not supported.

V. EVALUATION Evaluation was a very important part during the creation of the Web service. It allowed us to increase usability and to elicit further requirements. We have had two main evaluation phases. The first one is the paper prototype, and the second phase includes a test of the SeViAnno software prototype.

The software prototype was evaluated in two different ways. During the tests all session times and method calls were registered using MobSOS [33], thereby enabling us to find the most frequently used features. Besides analyzing data we conducted user surveys to evaluate the usability of the SeViAnno application.

Monitored by MobSOS, during our three day testing phase over 1000 method calls were executed. Most calls were made to receive data such as keywords, semantic annotations and full-text descriptions. This high percentage can be explained as all semantic tags are retrieved again after adding an annotation. 100 method calls represented the creation of new information for the four video files. One third was called to add keywords. 73 calls were caused by semantic tags, whereby place is

responsible for 31 calls. This can be explained by the content of the videos, as they contain a lot of maps shown in the video and change locations very often. Nevertheless, the very intuitive use of the map also plays a role. The prototype was tested by about ten persons, the majority of whom had a computer science background. Figure 8 depicts the distribution of MPEG-7 service method calls among the respective users.

Figure 8. MPEG-7 Service Usage Statistics for SeViAnno

User surveys had the intention to compare the prototype with VideoANT and M-OntoMat Annotizer. Every person had the possibility to evaluate every tool as long as he wanted to. The first tool was VideoANT, then M-OntoMat-Annotizer and the last one was the SeViAnno prototype itself. Results of the evaluation were that SeViAnno is regarded as being only a little less usable than VideoANT, but much better to use than M-OntoMat-Annotizer. SeViAnno is also regarded as much more precise than VideoANT regarding metadata, but a little bit less precise than M-OntoMat-Annotizer.

To conclude, both conducting user tests and evaluating the prototype is highly recommended to improve usability. A paper prototype provides an early indication of design flaws and missing features. It can avoid a complete reinvention of an already implemented user interface and save a lot of time reorganizing functions or adding essential functions during later development phases. Software prototypes deliver a good overview of already sophisticated functions and functions, which still need refinements. It can also deliver ideas for future development and help determine whether goals are reached.

At the end of the software prototype annotation we asked every user e.g. which tool provides the highest motivation to annotate videos. Seven persons answered SeViAnno, and two persons answered VideoANT. This shows us how important it is to have a simple and usable tool, especially in the field of applications, where research teams are interested in working together with hobbyists who do not have all the expertise professionals might have.

VI. CONCLUSIONS AND OUTLOOK Overall we can view the SeViAnno approach as a guideline or proof of concept for interactive semantization of multimedia using RIA, with evaluation results using two prototypes. We especially focused on the accessibility for hobbyists and semi- professionals, which play a big role for the Cult of the Amateur concept. We first described the requirements for such applications. The requirements are adjusted to the domain of cultural heritage management and try to find a trade-off between intuitive, but imprecise vs. precise, but complex tools.

Our evaluation results show that the trade-off of both aspects succeeded and that SeViAnno can be used for cultural heritage management.

Nevertheless, we found out that intuitiveness has its constraints. It is impossible to simplify the interface without losing a certain degree of preciseness. That is why we chose to reduce semantic tag types to Agent, Event, Object, Concept, Place and Time.

The evaluation phase gave us an insight into possible future functionality like adding regions to annotation (e.g. defining the exact position of a person in the video) or a timeline for more intuitive time annotations. The paper prototype approach is useful to test ideas which are not possible to implement yet.

In future, SeViAnno can be extended with a lot of additional functionality. A central role should be taken by a search function which allows searching through all semantic tags that were already created. Another possible extension is the integration of Twitter and Facebook for increased community interaction and motivation for non-professional users.

Another added value should be the refinement of the annotation. Instead of adding only information to the video file itself, the annotation could be made separately for audio and video tracks. Furthermore, it is possible to make use of the MPEG-7 concept of semantic relations, as they are already definable in tools such as Caliph/Emir [34] for images. However, it is questionable whether interface simplicity will not suffer, if users will have to explicitly specify these relations. Annotation automation could be helpful in this sense, e.g. object recognition and tracing. Especially interesting for cultural heritage management is the usage of automatic generation of building geometry. Many new devices with the name Image Assisted Total Stations are currently developed to fulfill this task [35].

The application will be rolled out to professional communities with different backgrounds (e.g. E-Learning) in order to evaluate the applicability and usability. The Bamiyan Community and Metadata Community are also included.

The extension of SeViAnno should lead to a more sophisticated, intuitive and useful Web service. A possible application domain is the multimedia documentation of monuments in Ghazni, Afghanistan, becoming the Islamic cultural capital in 2013 [36]. A mobile version of SeViAnno will allow direct on-site video capturing and annotation.

REFERENCES [1] B.S. Manjunath, P. Salembier, and T. Sikora, Introduction to MPEG-7:

Multimedia Content Description Interface. Wiley & Sons, 2002. [2] A. Keen, “The Cult of the Amateur: How Today's Internet is Killing Our

Culture”, Broadway Business, New York, 2007. [3] G. Wersig, Thesaurus-Leitfaden: Eine Einführung in das Thesaurus-

Prinzip in Theorie und Praxis, 2. erg. Aufl., Verlag Dokumentation Saur KG, München, 1985.

[4] E. Wenger, Communities of Practice: Learning, Meaning and Identity, Cambridge University Press, New York, 1998.

[5] M. Shacklett. “Gaugin Web site performance”, Credit Union Magazine, 67(2):60-62, 2001.

[6] G. Smith, “Tagging – People-Powered Metadata for the Social Web”, New Riders Publishing, Indianapolis, 2008.

[7] Y. Cao, M. Spaniol, R. Klamma, D. Renzel, “Virtual Campfire - A Mobile Social Software for Cross-Media Communities”, K. Tochtermann, H. Maurer, F. Kappe, A. Scharl (Eds.): Proceedings of I-Media'07, International Conference on New Media Technology and Semantic Systems, Graz, Austria, September 5 - 7, 2007, J.UCS (Journal of Universal Computer Science) Proceedings, pp. 192-195.

[8] T. O'Reilly, “What Is Web 2.0”, http://oreilly.com/web2/archive/what-is-web-20.html, Last access: May 27, 2010.

[9] T. Berners-Lee, J. Handler and Ora Lassila, “The Semantic Web”, Scientific American Magazine, New York, May, 2001.

[10] MPEG-7 Overview (version 10), Palma de Mallorca, Oct 2004. ISO/IEC JTC1/SC29/WG11N6828. http://mpeg.chiariglione.org/standards/mpeg-7/mpeg-7.htm, Last access: May 27, 2010.

[11] A. Hofmann, A. Kriechbaum, and W. Bailer, “Authoring Mobile Services Using MPEG-7 and MPEG-4 LASeR”, In WMM'09 Proceedings of 9th Workshop on Multimedia, Toulouse, France, 2009, pp. 80-90.

[12] Dublin Core Metadata Initiative (DCMI), Last access: May 27, 2010, http://www.dublincore.org.

[13] R. Klamma, M. Spaniol, M. Jarke, Y. Cao, M. Jansen and Georgios Toubekis, “A Hypermedia Afghan Sites and Monuments Database”, In Geographic Hypermedia – Concepts and Systems, LNGC, Springer-Verlag, 2006, pp. 189-209.

[14] J. Allaire, “Macromedia Flash MX – A next-generation rich internet client”, Macromedia, white paper, March, 2002.

[15] J. Tapper, M. Labriola, M. Boles and J. Talbot, “Adobe Flex 3 – Das offizielle Trainingsbuch”, Adobe Press, Addison-Wesley, München, 2008.

[16] J.J. Garrett, “Ajax: A New Approach to Web Applications”, Adaptive Path LLC, San Francisco, February 18, 2005.

[17] Microsoft Silverlight, Last access: May 27, 2010, http://silverlight.net. [18] H. Adams, “Mashup business scenarios and patterns”,

http://www.ibm.com/developerworks/lotus/library/mashups.patterns-pt1/, Last access: May 27, 2010.

[19] J. F. Jensen, “Interactivity. Tracking a New Concept in Media and Communication Studies”, Nordicom Review 19(1), 1998.

[20] Video Annotation Tool, Academic Technology Services, UMN, Last access: May 27, 2010, http://ant.umn.edu.

[21] Semantic Video Annotation. http://www.joanneum.at/fb2/iis/produkte-loesungen-services/semantic-video-annotation.html, Last access: May 27, 2010.

[22] IBM Research, “VideoAnnEx Annotation Tool”, Last Access: May 27, 2010, http://www.research.ibm.com/VideoAnnEx.

[23] S. Bloehdorn, K. Petridis, C. Saathoff, N. Simou, V. Tzouvaras, Y. Avrithis, S. Handschuh, Y. Kompatsiaris, S. Staab and M. G. Strintzis, “Semantic Annotation of Images and Videos for Multimedia Analysis”, In ESWC 2005, 2nd European Semantic Web Conference, Heraklion, Greece, May, 2005.

[24] C. Snyder, “Paper Prototyping: The Fast and Easy Way to Design and Refine User Interfaces (Interactive Technologies)”, Morgan Kaufmann, San Francisco, 2003.

[25] A. Dix, J. Finlay, G. D. Abowd, and R. Beale, "Human Computer Interaction", Prentice Hall, New Jersey, 2003.

[26] B. Battleson, A. Booth and J. Weintrop, “Usability testing of academic library Web site: A case study", Journal of Academic Librarianship, 27(3):188-189, 2001

[27] J. Nielsen, “Designing Web Usability: The Practice of Simplicity”, New Riders Publishing, Indianapolis, 2000.

[28] J. McLaughlin and D. Skinner, “Developing usability and utility: A comparative study of the users of new IT”, Technology Analysis Strategic Management, 12(3):413-423, 2000.

[29] D. Yamamoto, T. Masuda, S. Ohira and K. Nagao, “Collaborative Video Scene Annotation Based on Tag Cloud”, In Advances in Multimedia Information Processing, LNCS, Springer-Verlag, 2008, pp. 397-406.

[30] M. Spaniol, R. Klamma, H. Janßen, D. Renzel, “LAS: A Lightweight Application Server for MPEG-7 Services in Community Engines”, K. Tochtermann, H. Maurer (Eds.): Proceedings of I-KNOW '06, 6th International Conference on Knowledge Management, Graz, Austria, September 6 - 8, 2006, J.UCS (Journal of Universal Computer Science) Proceedings, Springer, pp. 592-599.

[31] FFMPEG, http://www.ffmpeg.org, Last access: May 27, 2010. [32] Kunst und Austellungshalle der Bunderrepublik Deutschland – Bonn,

http://www.kah-bonn.de/index.htm?ausstellungen/gandhara, Last access: May 27, 2010

[33] D. Renzel, R. Klamma, M. Spaniol, “MobSOS - A Testbed for Mobile Multimedia Community Services”, 9th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS '08), May 7-9, 2008, Klagenfurt, Austria.

[34] M. Lux, “Caliph & Emir: MPEG-7 photo annotation and retrieval”. Proceedings of the seventeenth ACM international conference on Multimedia, pp. 925-926, 2009, Beijing, China.

[35] B. H. Walser, Development and calibration of an image assisted total station, IGP Mitteilungen Nr.87/2005, Zürich, 2005, doi:10.3929/ethz-a-004845672, 2004.

[36] Y. Cao, “eWay: Ghazni will be the Islamic Cultural City 2013”, http://yiweicao.blogspot.com/2009/06/ghazni-will-be-islamic-cultural-city.html, Last access: May 27, 2010.