CONFERENCEABSTRACTS - LREC 2016

204
TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION Held under the Honorary Patronage of His Excellency Mr. Borut Pahor, President of the Republic of Slovenia MAY 23 28, 2016 GRAND HOTEL BERNARDIN CONFERENCE CENTRE Portorož, SLOVENIA CONFERENCE ABSTRACTS Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis. Assistant Editors: Sara Goggi, Hélène Mazo The LREC 2016 Proceedings are licensed under a Creative Commons Attribution- NonCommercial 4.0 International License

Transcript of CONFERENCEABSTRACTS - LREC 2016

TENTH INTERNATIONAL CONFERENCE ON

LANGUAGE RESOURCES AND EVALUATION

Held under the Honorary Patronage of His Excellency Mr. Borut Pahor, President of the Republic of Slovenia

MAY 23 – 28, 2016

GRAND HOTEL BERNARDIN CONFERENCE CENTRE

Portorož, SLOVENIA

CONFERENCE ABSTRACTS

Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis.

Assistant Editors: Sara Goggi, Hélène Mazo

The LREC 2016 Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

ii

LREC 2016, TENTH INTERNATIONAL CONFERENCE ON LANGUAGE

RESOURCES AND EVALUATION

Title: LREC 2016 Conference Abstracts

Distributed by:

ELRA – European Language Resources Association 9, rue des Cordelières 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: [email protected] and [email protected]

ISBN 978-2-9517408-9-1 EAN 9782951740891

iii

Introduction of the Conference Chair and ELRA President

Nicoletta Calzolari

Welcome to the 10th edition of LREC in Portorož, back on the Mediterranean Sea!

I wish to express to his Excellency Mr. Borut Pahor, the President of the Republic of Slovenia, the gratitude of the Program Committee, of all LREC participants and my personal for his Distinguished Patronage of LREC 2016.

Some figures: previous records broken again!

It is only the 10th LREC (18 years after the first), but it has already become one of the most successful and popular conferences of the field. We continue the tradition of breaking previous records. We received 1250 submissions, 23 more than in 2014. We received 43 workshop and 6 tutorial proposals.

The Program Committee is confronted at every LREC with a harder and harder job, going through 3750 reviews, to understand – also beyond the scores and in particular when they greatly differ – the relevance, the novelty, but also the appropriateness for an oral or poster presentation. We have in the program 744 papers: 203 Orals and 541 Posters.

We recruited an impressive number of reviewers, 1046 (76 more than in 2014), to keep the number of papers per reviewer rather low. This was a great effort in which a very large amount of our community was involved. To reach this number we had to invite 1427 colleagues, out of which 182 declined and 199 regrettably did not answer. At the end I must say that a number of reviewers (not so many) did not do their duty and we had to recruit some others in a hurry.

We also have 30 Workshops and 6 Tutorials.

More than 1100 participants have already registered at the beginning of May.

These figures and the continuously growing trend have a clear meaning. The field of Language Resources and Evaluation is very alive and constantly flourishing. And LREC seems still to be – as many say – “the conference where you have to be and where you meet everyone”.

LREC acceptance rate: a reasoned choice

Also this time, as I usually do, I want to highlight the LREC acceptance rate, 59.52 this year, unusual in other major conferences but for us a reasoned choice. This level of acceptance rate is a special feature of LREC and is probably one of the reasons why LREC succeeds to give us an overall picture of the field and to reveal how it is evolving. For us it is in fact important not only to look at the top methods but also to see how much various methods or resources are able to spread, for which purposes and usages and among which languages. Multilingualism – and equal treatment of all languages – is an essential feature of LREC, as it is the attempt of putting the text, speech and multimodal communities together.

The acceptance rate goes together with the sense of inclusiveness that is important for us (instead of the sense of elitism associated with a low acceptance rate).

iv

And I want to underline again that quality is not necessarily undermined by a high acceptance rate, but it is also determined by the influence of the papers on the community: the ranking of LREC among other conferences in the same area proves this. According to Google Scholar h-index, LREC ranks 4th in Computational Linguistics top publications.

I was really proud when a colleague recently told me that LREC, with its broad variety of topics, is the conference where he gets more ideas than in any other!

LREC 2016 Trends

From one LREC to another I have tried (since 2004) to spot, even if in a cursory and subjective way, the major trends and the going up and down of certain topics.

After highlighting the major trends of 2016, also in comparison to 2014 and previous years, I make here few general considerations. The comparison with previous years highlights the topics where we find steady progress, or even great leaps forward, the stable ones and those that may be more affected by the fashion of the moment.

Trends in LREC2016 topics, also compared to 2014

Among the areas that continue to be trendy and are increasing I can mention: ▪ Social Media analysis, started in 2012 and increasing in 2014, is doubling again ▪ Discourse, Dialogue and Interactivity ▪ Treebanks, with a big increase with respect to the past ▪ Less-resourced languages ▪ Semantics in general and in particular Sentiment, Emotion and in general Subjectivity ▪ Information extraction, Knowledge discovery, Text mining ▪ Multilinguality in general and Machine Translation ▪ Evaluation methodologies

Unsurprisingly many papers in the “usual” topics: ▪ Lexicons, even if a bit decreasing ▪ Corpora ▪ Infrastructural issues, policies, strategies and Large projects: topics that receive special

attention at LREC, differently from other major conferences. Another distinguishing feature for LREC.

Newer trends: ▪ Digital Humanities ▪ Robotics

Stable topics: ▪ Speech related topics, a little increasing but not as much as we would like ▪ Multimodality ▪ Grammar and syntax ▪ Linked data, a new topic in 2014, remains stable ▪ Computer Aided Language Learning, an increasing topic in 2014, is stable

v

Less-represented topics with respect to the past: ▪ Web services and workflows ▪ Sign language (probably because there is a very successful workshop on this) ▪ Ontologies ▪ Standards and metadata ▪ Temporal and spatial annotation ▪ Crowdsourcing

Overall trends, from 2004 … and before

From 2004 we observe a big increase in papers related to Multilingualism and Machine Translation. This may also be related to the funding of Machine Translation projects from the European Commission.

The analysis of Social media and Subjectivity with sentiment, opinion, emotions, has started in this time span and is not only well consolidated but also continually expanding.

There is declining tendency for papers relayed to grammar and syntax. This however makes even more interesting the high increase of papers on Treebanks this year.

There seems to be a small decrease of papers on Lexicons and Lexical acquisition as well as on Terminology. It was probably a more popular topic years ago when many WordNet and FrameNet lexicons were built for many languages.

I have recently been reminded by a colleague at ILC of a paper I wrote many years ago with some considerations on Computational Linguistics as reflected by the papers at COLING 1982. There was obviously no mention of Language Resources (the term itself was coined by Zampolli later on), but I underlined already then the element of novelty constituted by the area that was called at that time “Linguistic Data Bases”, with some papers on dictionaries “in machine-readable form”. While I never mentioned in my review, and probably the papers too, the word “Corpus”! Only 30 years ago Computational Linguistics was a totally different field. The new area of Language Resources was born some years after those initial pioneering sparse papers. But this new topic, as testified by the success of LREC, has expanded incredibly fast.

And a new community has taken shape around Language Resources. A peculiarity of this community is the attention paid to infrastructural issues, to overall strategies and policies. This is also due, I believe, to the fact that in many cases we have to work in large groups, for many languages, we must be able to build on each other work, to connect different resources and tools, to make available what already exists and use standardised formats. Infrastructures (on many dimensions) are really needed for this field to progress.

I wrote in the introduction to LRE2006, 10 years ago: “Do we have revolutions? Probably not. Even if the stable growth of the field brings in itself some sort of revolution. After a proliferation of LRs and tools, we need now to converge. We need more processing power, more integration of modalities, more standards and interoperability, more sharing (in addition to distribution), more cooperative work (and tools enabling this), which means also more infrastructures and more coordination.” I think that many of the needs that I expressed then are being achieved today, as testified by the papers in this edition of LREC, and therefore we can probably speak of a sort of quiet revolution.

vi

LREC Proceedings in Thomson Citation Index

I remind also that since 2010 the LREC Proceedings have been accepted for inclusion in CPCI (Thomson Reuters Conference Proceedings Citation Index). This is quite an important achievement, providing a better recognition to all LREC authors and useful in particular for young colleagues.

ELRA and LREC

ELRA 20th Anniversary: achievements, promotion of new trends and community building

In 2015 we organised a workshop for the 20th anniversary of ELRA, founded in 1995. I think it is a big success the fact that ELRA has remained in the Language Technology picture with growing influence in these 20 years, even more so given that ELRA does not rely on specific public funding.

ELRA has implemented, over the years, many services for the language resource community, promoting the integration not only of different modalities but also of different communities. And ELRA itself has evolved to reflect the evolution of the field and sometimes to anticipate new trends.

I mention here a few services: Evaluation, Validation, Production of LRs, Networking, LREC, LRE Map, Sharing of LRs, Licensing wizard, ISLRN, LRE Journal, Less-resourced languages committee. These initiatives must not be seen as unrelated steps, but as part of a coherent vision, promoting a new culture in our community. Many of ELRA actions are of infrastructural nature, knowing that research is affected also by such infrastructural activities.

LREC is probably the most influential ELRA achievement, with the major impact on the overall community. Also through LREC, ELRA has certainly contributed to shape our field, making the Language Resource field a scientific field in its own right.

This is testified also by the LRE journal, coedited by Nancy Ide and myself, with submissions in continuous increase.

Citation of Language Resources

This year we introduced and encouraged citations of Language Resources, providing recommendations on how to cite: we have in fact added a special References section. This must become normal practice, to keep track of the relevance of LRs but also to provide due recognition to those who work on language resources.

This will be implemented also in the LRE journal.

Replicability of research results

We continue to promote, also through many of the initiatives above, greater visibility of LRs, sharing LRs in an easier way and replicability of research results as a normal part of scientific practice. ELRA is thus strengthening the LR scientific ecosystem and fostering sustainability.

vii

Acknowledgments

As usual, it is my pleasure to express here my deepest gratitude to all those who made this LREC 2016 possible and hopefully successful.

I first thank the Program Committee members, not only for their dedication in the huge task of selecting the papers, the workshops and tutorials, but also for the constant involvement in the various aspects around LREC. A particular thanks goes to Jan Odijk, who has been so helpful in the preparation of the programme. To Joseph Mariani for his always wise suggestions. And obviously to Khalid Choukri, who is in charge of so many aspects around LREC.

I thank ELRA and the ELRA Board: LREC is a major service from ELRA to all the community!

A very special thanks goes to Sara Goggi and Hélène Mazo, the two Chairs of the Organizing Committee, for all the work they do with so much dedication and competence, and also the capacity to tackle the many big and small problems of such a large conference (not an easy task). They are the two pillars of LREC, without whose commitment for many months LREC would not happen. So much of LREC organisation is on their shoulders, and this is visible to all participants.

I am grateful also to the Local Committee, especially to Simon Krek (its Chair) and Marko Grobelnik, for their help in organising a successful LREC.

My appreciation goes also to the distinguished members of the Local Advisory Committee for their constant support.

I express my great gratitude to the Sponsors that believe in the importance of our conference, and have helped with financial support. I am grateful to the authorities, and the associations and organisations that have supported LREC in various ways.

Furthermore, on behalf of the Program Committee, I praise our impressively large Scientific Committee. They did a wonderful job.

I thank the workshop and tutorial organisers, who complement LREC of so many interesting events.

A big thanks goes to all the LREC authors, who provide the “substance” to LREC, and give us such a broad picture of the field.

I finally thank the two institutions that always dedicate a great effort to LREC, i.e. ELDA in Paris and ILC-CNR in Pisa. Without their commitment LREC would not be possible. The last, but not least, thanks are thus, in addition to Hélène Mazo and Sara Goggi, to all the others who – with different roles – have helped and will help during the conference: Paola Baroni, Roberto Bartolini, Dominique Brunato, Irene De Felice, Riccardo Del Gratta, Meritxell Fernández Barrera, Francesca Frontini, Lin Liu, Valérie Mapelli, Monica Monachini, Vincenzo Parrinelli, Vladimir Popescu, Valeria Quochi, Caroline Rannaud, Irene Russo, Priscille Schneller, Alexandre Sicard. You will meet most of them during the conference.

I also hope that funding agencies will be impressed by the quality and quantity of initiatives in our sector that LREC displays, and by the fact that the field attracts all the best groups of R&D from all continents. The success of LREC for us actually means the success of the field of Language Resources and Evaluation.

viii

And lastly, my final words of appreciation are for all the LREC 2016 participants. Now LREC is in your hands. You are the true protagonist of LREC; we have worked for you all and you will make this LREC great. I hope that you discover new paths, that you perceive the ferment and liveliness of the field, that you have fruitful conversations (conferences are useful also for this) and most of all that you profit of so many contacts to organise new exciting work and projects in the field of Language Resources and Evaluation … which you will show at the next LREC.

LREC is again in a Mediterranean location this time! I am sure you will like Portorož and Slovenia and the Mediterranean atmosphere. And I hope that Portorož and Piran will appreciate the invasion of LRECers!

With all the Programme Committee, I welcome you at LREC 2016 and wish you a very fruitful Conference.

Enjoy LREC 2016 in Portorož!

Nicoletta Calzolari

Chair of the 10th International Conference on Language Resources & Evaluation and ELRA President

ix

Message from ELRA Secretary General and ELDA Managing Director

Khalid Choukri

Welcome to LREC 2016, the 10th edition of one of the major events in language sciences and technologies and the most visible service of the European Language Resources Association (ELRA) to the community. Let me first express, on behalf of the ELRA/ELDA and LREC team, our profound gratitude to His Excellency Mr. Borut Pahor, President of the Republic of Slovenia, for his Distinguished Patronage of LREC2016 and for honoring us with his presence.

ELRA & LREC

Since 1998 and the first LREC in Granada, it has been a privilege to speak to the language sciences and technologies communities every two years. We always feel gifted to have the chance to share ELRA's views, concerns, expectations, and plans for the future, with over a thousand experts gathered in a lovely and relaxed atmosphere. It is also an important occasion to report on our observations of the community activities, in particular on the recent trends, that we continuously monitor. It is very rewarding to do so, in a place to be remembered. The ELRA Board has always done its best to make this event memorable, an event with a rich scientific content, offering great networking opportunities, occasions to initiate new projects, to form new friendship with colleagues in the same field, and a lot of emotions. Let me fist share with you a deep feeling about our organization of LREC before expressing some remarks about our recent activities and those of our community. When discussing LREC within the ELRA Board, the first major step to be taken relates to identifying the next location. The usual and typical requirements, concerning logistics, are the first and most critical debated criteria. The motivation and the support of a local community is also a decisive factor. Remember that there is no call for bids but rather spontaneous proposals from local teams. Since its inception, LREC has been organized by the same permanent team, with the support of local committees, representing our field. Last but not least, the attractiveness of the location is a crucial element, now part of the conference’s DNA. Our motto has always been that working and meeting in very relaxing conditions can improve our efficiency and productivity and enhance our interactions. We never jeopardized the scientific part when selecting our locations.

ELRA monitoring the new HLT trends

Over the last 9 LRECs and this tenth one, we have had the privilege to witness the emergence of new research fields but also the deployment of impressive applications, that, we, as a community (and also as ELRA) have proudly set-up the ground for. ELRA was established in 1995 and the first LREC took place in Granada in 1998. Last November 2015, we celebrated our 20th anniversary. It was very exciting to discuss how far we have really come in fulfilling some of the objectives that were essential to the development of our field: many of us remember how challenging it was to develop automatic speech recognition

x

for very basic tasks (even to recognize the 10 digits and some command words, or to discriminate between all forms of Yes and No, in a few languages, the usual “big” ones), not to mention dialogue systems and speech understanding. How challenging and critical it was to make tiny progresses in Machine Translation systems, mostly for a few "lucrative" languages and to develop methodologies to assess such performances. Extending these tasks with more challenging achievements is today one of the core activities of many large research teams all over the world, and one can imagine the underlying techniques composing all these applications, from processing capacities to basic tools (image and acoustic processing, morphological, syntactic, sematic analysis) and curation of appropriate Language Resources. Everything that is affordable for many players, using open source packages, cloud computing facilities, etc. Breakthroughs are impressive and many applications are now deployed and used by the public at large. However, the bottleneck remains, as in the past, the availability of open licensable Language Resources and their sharing within the community, in particular for the less-resourced languages. Our community is probably one of the "Big Data" communities that benefited largely from the emergence of the web. The web came as a repository of treasures of data that pushed our "data-driven" paradigms. The web brought also new challenges and potential problems related to language processing, understanding, summarization, generation, translation, etc. The web boosted new needs for tackling data and text mining, sentiment analysis, opinion detection, etc. Twitter and other social media, for instance, have now become one of the main data sources while generating scientific problems the research communities have to address. These trends are now common ground to many research topics and activities, but they also highlight the serious gaps between languages and the many barriers that are hindering the progress in various fields and geographical areas.

ELRA and the legal issues

In addition to the problems related to discovering and interoperating data sets, a critical obstacle that has limited our investigations has to do with legal and ethical aspects. These are important issues and ELRA always worked on raising the awareness of the various stakeholders (research and industrial communities as well as policy makers). I often stressed this in my speeches at LRECs: the legal and ethical issues are major topics that require more lobbying and petitioning from our community. The research progress, in particular when it is not-for-profit and carried out in academic spheres, should not be impeded by so many obstacles, most of which should have disappeared with the emergence of internet and the digital world. The most critical issue is the copyright, along with related laws and regulations, which prevents the re-use of data for research. The idea is not to restrict the rights of authors and other intellectual property of authors and creators, but, on the contrary, to ensure a legal fair use of copyrighted works by the research community to prevent misuses that are very common, as illustrated by the statement "the web as a corpus", which seems to imply that online content is by default freely reusable. A few countries have adopted the doctrine of fair use, giving to their research community a very "competitive" advantage. This is the case of the USA (where the Federal Government works are exempt from copyright protection). An important move from the European Commission (EC) is its challenging objective to establish a Single Digital Market across the European Union Member States. The language barriers are only one of the obstacles that the

xi

European Commission will face in establishing the SDM. To boost the innovation based on the data held by the public sector bodies, the EC has issued an important regulation, the Public Sector Information (PSI)1 Directive. It requires all public bodies to release the data they produce to the public so the data can be used for innovative applications. Linguistic data is part of the deal and we hope to see more resources with cleared IPR in our catalogues and in our repositories soon, for use by all. The USA and European Union are simple examples of a large international movement that hopefully will be beneficial to our community. However, as legal aspects are always subject to interpretation, the "Territoriality and Extraterritoriality in Intellectual Property Law"2 does not help to clear these issues. It is crucial that an international harmonization addresses this (maybe a serious amendment of existing conventions). Our community should contribute to the initiatives going on in various countries about the copyright and other related rights in the information society3, to boost data sharing and re-use. ELRA strongly supports the Open Data movement and advocates for making public sector information more accessible and re-usable, without any license or through very permissive and open licenses. Another topic debated at the 20th anniversary of ELRA, is the ethical issues that have to be considered in our field, either when dealing with Data Management (for instance the crowdsourcing approach to data production) or when replicating experiments and citing publications of other colleagues. I am very happy to see that the dedicated workshop on "Legal Issues" (http://www.elra.info/en/dissemination/elra-events/legal-issues-workshop-lrec2016/) is taking place at LREC, once more, with a large number of registered participants. Another workshop on ethical issues, is scheduled within this LREC (ETHics In Corpus Collection, Annotation and Application, http://emotion-research.net/sigs/ethics-sig/ethi-ca2).

Replicating experiments and Data Citation

This crucial topic of replicating experiments covers a large spectrum of behaviors and was reviewed by the experts group, gathered to discuss the activities of our field at the twentieth anniversary of ELRA. The topic refers to "Reproducibility of the Research Results and Resources Citation in Science and Technology of Language". The topic has emerged as a hot topic for discussion within many fields of research activities. With António Branco and Nicoletta Calzolari, we made a proposal for a discussion workshop to debate this topic at LREC. I am very grateful to António Branco who agreed to take the lead on this sensitive topic (4REAL Workshop: http://4real.di.fc.ul.pt/). Like in many scientific fields, several dimensions are important and require specific considerations. Of course, maintaining research integrity is essential and requires that replication of published results is possible, and even guaranteed. Such replication can only be done through sharing of resources and approaches. The same requirements are needed to compare results across approaches which, in addition, require

1 https://ec.europa.eu/digital-single-market/en/european-legislation-reuse-public-sector-information 2 Alexander Peukert, “Territoriality and extra-territoriality in intellectual property law”, in Günther Handl, Joachim Zekoll & Peer Zumbansen (eds.), Beyond Territoriality: Transnational Legal Authority in an Age of Globalization, Queen Mary Studies in International Law, Brill Academic Publishing, Leiden/Boston, 2012, 189-228. 3 Study on the application of Directive 2001/29/EC on copyright and related rights in the information society (the "Infosoc Directive"), Jean-Paul Triaille, Séverine Dusollier, Sari Depreeuw, Jean-Paul TRIAILLE (ed.), ISBN: 978-92-79-29918-6, DOI: 10.2780/90141, © European Union, 2013.

xii

that one clearly identifies the resources used in the benchmarking. The identification of the resources, and thus, their citations, are correlated and more attention is required than what we did so far. The use of the ISLRN (International Standard Language Resource Number, http://www.islrn.org/) has been introduced and is being supported by the major data centers. We hope that monitoring this process will show how impactful it is on our field. Citation mechanisms, applied to Language Resources, affect the so called "impact factor" and the underlying research indexes, and have to be taken care of by the community.

ELRA & Data Management Plan

Another major topic that was briefly discussed at the ELRA 20th anniversary is Data Management Plans and the underlying Language Resource Sustainability factors (DMP&S). Within its activities and through a number of projects, ELRA has always advocated for a comprehensive data management strategy that would ensure efficient management of the production, repurposing or repackaging processes, a clear and adequate validation process, sharing and distribution mechanisms, and a sustainability plan4, (http://www.flarenet.eu/sites/default/files/D2.2a.pdf). It is now common practice by most of the funding agencies to request a Data Management Plan to be part of the proposals they receive. We hope that such plans are seriously designed and not only as administrative sections of the proposals. ELRA will continue to work on this challenging task for the emerging resource types and contexts of use and will continue to offer its support (including through its helpdesk on technical and legal matters) to the proposers and project managers. ELRA is working on a very specific tool to help Language Resource managers produce their own DMP. ELRA is working on a specific wizard, based on its background and input from other projects (the MIT Libraries’ Research Data Management Plan5, and the initiative held by the Inter-university Consortium for Political and Social Research6). To assess how this "sustainability" dimension is taken into consideration, ELRA has established a set of resources that have been monitored, by our Language Resource experts, since 2010. We started the list with resources that were not part of the catalogues of data centers. One would be surprised to see that over 28% have simply disappeared from the radars, with many others requiring a thorough search before one can find them again. And it is always complicated to identify them accurately and verify they correspond to the ones on our list. Again this emphasizes the need for repositories that adhere to a clear code of conduct. The simple "web" (and URLs) will not constitute a reliable and persistent repository.

LREC 2016, some features

As usual, LREC 2016 features a large number of workshops. We are proud to continue to support the Sign language workshop that is building bridges between several research communities and setting partnerships for research on so many modalities (http://www.sign-lang.uni-hamburg.de/lrec2016/workshops.html). We are happy to see that the community is very active in paying attention to the less-resourced and under-resourced languages (http://www.ilc.cnr.it/ccurl2016/index.htm), a topic that is now taken care of by a dedicated

4 Khalid Choukri and Victoria Arranz, "An Analytical Model of Language Resource Sustainability", Proceedings of LREC2012. 5 https://libraries.mit.edu/data-management/plan/write/ 6 http://www.icpsr.umich.edu/

xiii

ELRA committee, the LRL committee. Many other specific topics are covered in this edition’s workshops: Arabic and Indian languages, social media, emotions, affect analysis, multimodal resources, and several workshops dealing with MT and MT-related aspects. LREC 2016 is also featuring a panel discussion with some of the major funding agencies. We hope to draw some conclusions about their past activities but also to discuss roadmaps for the next decade. Last but not least, the importance of our technologies was emphasized when a strong earthquake hit Nepal in April 2015. Many teams offered their technologies to help the rescue groups, in particular the multilingual applications. Some teams designed and quickly developed new applications for this. ELRA donated Nepalese resources for this purpose. We are thankful to our partners who agreed to waive all fees on these resources and very grateful to those who developed the applications that may have helped a little bit.

Acknowledgments:

Finally, I would like to express my deep thanks to our partners and supporters, who throughout

the years make LREC so successful. I would like to thank our Bronze sponsors: EML (European

Media Laboratory GmbH) and Intel, our supporter: Viseo, and our media sponsor: MultiLingual

Computing, Inc.

I also would like to thank the HLT Village participants, we hope that such gathering offers the

projects an opportunity to foster their dissemination and hopefully to discuss exploitation plans

with the participants.

I would like to thank the Local Advisory Committee. Its composition of the most distinguished

personalities of Slovenia denotes the importance of language and language technologies for

the country. We do hope that it is a strong sign for the long-term commitment of Slovenian

officials.

I would like to thank the LREC Local Committee, chaired by Dr Simon Krek and the LREC Local

Organizing Committee, chaired by Marko Grobelnik, in particular Špela Sitar and Monika Kropej for providing support to the organization of this LREC Edition in Slovenia.

Finally I would like to warmly thank the joint team of the two institutions that devoted so much

effort over months and often behind curtains to make this one week memorable: ILC-CNR in

Pisa and my own team, ELDA, in Paris. These are the two LREC coordinators and pillars: Sara

Goggi and Hélène Mazo, and the team: Roberto Bartolini, Irene De Felice, Meritxell Fernández-

Barrera, Riccardo Del Gratta, Francesca Frontini, Lin Liu, Valérie Mapelli, Monica Monachini,

Vincenzo Parrinelli, Vladimir Popescu, Caroline Rannaud, Irene Russo, Priscille Schneller and

Alexandre Sicard.

Now LREC 2016 is yours; we hope that each of you will achieve valuable results and accomplishments. We, ELRA and ILC-CNR staff, are at your disposal to help you get the best out of it. Once again, welcome to Portorož and Slovenia, welcome to LREC 2016

xiv

Table of Contents

O1 - Machine Translation and Evaluation (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

O2 - Sentiment Analysis and Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

O3 - Corpora for Language Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

O4 - Spoken Corpus Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

P01 - Anaphora and Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

P02 - Computer Aided Language Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

P03 - Evaluation Methodologies (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

P04 - Information Extraction and Retrieval (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

O5 - LR Infrastructures and Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

O6 - Multimodality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

O7 - Multiword Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

O8 - Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

P05 - Machine Translation (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

P06 - Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

P07 - Speech Corpora and Databases (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

P08 - Summarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

P09 - Word Sense Disambiguation (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

O9 - Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

O10 - Multilingual Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

O11 - Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

O12 - OCR for Historical Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

P10 - Discourse (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

P11 - Morphology (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

P12 - Sentiment Analysis and Opinion Mining (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

P13 - Semantics (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

O13 - Large Projects and Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

O14 - Document Classification and Text Categorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

O15 - Morphology (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

O16 - Phonetics and Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

P14 - Lexical Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

P15 - Multimodality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xv

P16 - Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

P17 - Part of Speech Tagging (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

P18 - Treebanks (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

O17 - Language Resource Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

O18 - Tweet Corpora and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

O19 - Dependency Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

O20 - Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

P19 - Discourse (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

P20 - Document Classification and Text Categorisation (1) . . . . . . . . . . . . . . . . . . . . . . . . . . 62

P21 - Evaluation Methodologies (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

P22 - Information Extraction and Retrieval (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

P23 - Prosody and Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

P24 - Speech Processing (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

O21 - Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

O22 - Anaphora and Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

O23 - Machine Learning and Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

O24 - -Speech Corpus for Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

P25 - Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

P26 - Emotion Recognition/Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

P27 - Machine Translation (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

P28 - Multiword Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

P29 - Treebanks (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

P30 - Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

P31 - LR Infrastructures and Architectures (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

P32 - Large Projects and Infrastructures (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

P33 - Morphology (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

P34 - Semantic Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

O25 - Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

O26 - Discourse and Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

O27 - Machine Translation and Evaluation (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

O28 - Corpus Querying and Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

P35 - Grammar and Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

P36 - Sentiment Analysis and Opinion Mining (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

P37 - Parallel and Comparable Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

P38 - Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

P39 - Word Sense Disambiguation (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

O29 - Panel on International Initiatives from Public Agencies . . . . . . . . . . . . . . . . . . . . . . . . 106

O30 - Multimodality, Multimedia and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

O31 - Summarisation and Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

O32 - Morphology (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

xvi

P40 - Dialogue (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

P41 - Language Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

P42 - Less-Resourced Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

P43 - Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

O33 - Textual Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

O34 - Document Classification, Text categorisation and Topic Detection . . . . . . . . . . . . . . . . . . . 118

O35 - Detecting Information in Medical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

O36 - Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

O37 - Robots and Conversational Agents Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

O38 - Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

O39 - Corpora for Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

O40 - Treebanks and Syntactic and Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

P44 - Corpus Creation and Querying (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

P45 - Evaluation Methodologies (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

P46 - Information Extraction and Retrieval (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

P47 - Semantic Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

P48 - Speech Processing (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

O41 - Discourse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

O42 - Twitter Related Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

O43 - Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

O44 - Speech Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

P49 - Corpus Creation and Querying (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

P50 - Document Classification and Text Categorisation (2) . . . . . . . . . . . . . . . . . . . . . . . . . . 142

P51 - Multilingual Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

P52 - Part of Speech Tagging (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

O45 - Lexicons: Wordnet and Framenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

O46 - Digital Humanities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

O47 - Text Mining and Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

O48 - Corpus Creation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

P53 - Dialogue (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

P54 - LR Infrastructures and Architectures (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

P55 - Large Projects and Infrastructures (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

P56 - Semantics (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

P57 - Speech Corpora and Databases (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

Authors Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

xvii

xviii

O1 - Machine Translation and Evaluation (1)Wednesday, May 25, 11:35

Chairperson: Bente Maegaard Oral Session

Evaluating Machine Translation in a UsageScenario

Rosa Gaudio, Aljoscha Burchardt and António Branco

In this document we report on a user-scenario-based evaluation

aiming at assessing the performance of machine translation (MT)

systems in a real context of use. We describe a sequel of

experiments that has been performed to estimate the usefulness

of MT and to test if improvements of MT technology lead to

better performance in the usage scenario. One goal is to find

the best methodology for evaluating the eventual benefit of a

machine translation system in an application. The evaluation

is based on the QTLeap corpus, a novel multilingual language

resource that was collected through a real-life support service via

chat. It is composed of naturally occurring utterances produced

by users while interacting with a human technician providing

answers. The corpus is available in eight different languages:

Basque, Bulgarian, Czech, Dutch, English, German, Portuguese

and Spanish.

Using BabelNet to Improve OOV Coverage inSMT

Jinhua Du, Andy Way and Andrzej Zydron

Out-of-vocabulary words (OOVs) are a ubiquitous and difficult

problem in statistical machine translation (SMT). This paper

studies different strategies of using BabelNet to alleviate the

negative impact brought about by OOVs. BabelNet is a

multilingual encyclopedic dictionary and a semantic network,

which not only includes lexicographic and encyclopedic terms,

but connects concepts and named entities in a very large network

of semantic relations. By taking advantage of the knowledge

in BabelNet, three different methods – using direct training

data, domain-adaptation techniques and the BabelNet API – are

proposed in this paper to obtain translations for OOVs to improve

system performance. Experimental results on English-Polish

and English-Chinese language pairs show that domain adaptation

can better utilize BabelNet knowledge and performs better than

other methods. The results also demonstrate that BabelNet is a

really useful tool for improving translation performance of SMT

systems.

Enhancing Access to Online Education: QualityMachine Translation of MOOC Content

Valia Kordoni, Antal van den Bosch, Katia LidaKermanidis, Vilelmini Sosoni, Kostadin Cholakov, IrisHendrickx, Matthias Huck and Andy Way

The present work is an overview of the TraMOOC (Translation for

Massive Open Online Courses) research and innovation project,

a machine translation approach for online educational content.

More specifically, videolectures, assignments, and MOOC forum

text is automatically translated from English into eleven European

and BRIC languages. Unlike previous approaches to machine

translation, the output quality in TraMOOC relies on a multimodal

evaluation schema that involves crowdsourcing, error type

markup, an error taxonomy for translation model comparison, and

implicit evaluation via text mining, i.e. entity recognition and its

performance comparison between the source and the translated

text, and sentiment analysis on the students’ forum posts. Finally,

the evaluation output will result in more and better quality in-

domain parallel data that will be fed back to the translation

engine for higher quality output. The translation service will

be incorporated into the Iversity MOOC platform and into the

VideoLectures.net digital library portal.

The Trials and Tribulations of PredictingPost-Editing Productivity

Lena Marg

While an increasing number of (automatic) metrics is available

to assess the linguistic quality of machine translations, their

interpretation remains cryptic to many users, specifically in

the translation community. They are clearly useful for

indicating certain overarching trends, but say little about actual

improvements for translation buyers or post-editors. However,

these metrics are commonly referenced when discussing pricing

and models, both with translation buyers and service providers.

With the aim of focusing on automatic metrics that are easier to

understand for non-research users, we identified Edit Distance (or

Post-Edit Distance) as a good fit. While Edit Distance as such

does not express cognitive effort or time spent editing machine

translation suggestions, we found that it correlates strongly with

the productivity tests we performed, for various language pairs

and domains. This paper aims to analyse Edit Distance and

productivity data on a segment level based on data gathered over

some years. Drawing from these findings, we want to then explore

how Edit Distance could help in predicting productivity on new

1

content. Some further analysis is proposed, with findings to be

presented at the conference.

PE2rr Corpus: Manual Error Annotation ofAutomatically Pre-annotated MT Post-edits

Maja Popovic and Mihael Arcan

We present a freely available corpus containing source language

texts from different domains along with their automatically

generated translations into several distinct morphologically rich

languages, their post-edited versions, and error annotations of the

performed post-edit operations. We believe that the corpus will

be useful for many different applications. The main advantage

of the approach used for creation of the corpus is the fusion

of post-editing and error classification tasks, which have usually

been seen as two independent tasks, although naturally they are

not. We also show benefits of coupling automatic and manual

error classification which facilitates the complex manual error

annotation task as well as the development of automatic error

classification tools. In addition, the approach facilitates annotation

of language pair related issues.

O2 - Sentiment Analysis and EmotionRecognitionWednesday, May 25, 11:35

Chairperson: Núria Bel Oral Session

Sentiment Lexicons for Arabic Social Media

Saif Mohammad, Mohammad Salameh and SvetlanaKiritchenko

Existing Arabic sentiment lexicons have low coverage–with only

a few thousand entries. In this paper, we present several large

sentiment lexicons that were automatically generated using two

different methods: (1) by using distant supervision techniques

on Arabic tweets, and (2) by translating English sentiment

lexicons into Arabic using a freely available statistical machine

translation system. We compare the usefulness of new and old

sentiment lexicons in the downstream application of sentence-

level sentiment analysis. Our baseline sentiment analysis system

uses numerous surface form features. Nonetheless, the system

benefits from using additional features drawn from sentiment

lexicons. The best result is obtained using the automatically

generated Dialectal Hashtag Lexicon and the Arabic translations

of the NRC Emotion Lexicon (accuracy of 66.6%). Finally,

we describe a qualitative study of the automatic translations of

English sentiment lexicons into Arabic, which shows that about

88% of the automatically translated entries are valid for English

as well. Close to 10% of the invalid entries are caused by gross

mistranslations, close to 40% by translations into a related word,

and about 50% by differences in how the word is used in Arabic.

A Language Independent Method for GeneratingLarge Scale Polarity Lexicons

Giuseppe Castellucci, Danilo Croce and Roberto Basili

Sentiment Analysis systems aims at detecting opinions and

sentiments that are expressed in texts. Many approaches in

literature are based on resources that model the prior polarity of

words or multi-word expressions, i.e. a polarity lexicon. Such

resources are defined by teams of annotators, i.e. a manual

annotation is provided to associate emotional or sentiment facets

to the lexicon entries. The development of such lexicons is an

expensive and language dependent process, making them often

not covering all the linguistic sentiment phenomena. Moreover,

once a lexicon is defined it can hardly be adopted in a different

language or even a different domain. In this paper, we present

several Distributional Polarity Lexicons (DPLs), i.e. large-scale

polarity lexicons acquired with an unsupervised methodology

based on Distributional Models of Lexical Semantics. Given a set

of heuristically annotated sentences from Twitter, we transfer the

sentiment information from sentences to words. The approach is

mostly unsupervised, and experimental evaluations on Sentiment

Analysis tasks in two languages show the benefits of the generated

resources. The generated DPLs are publicly available in English

and Italian.

Sentiment Analysis in Social Networks throughTopic modeling

Debashis Naskar, Sidahmed Mokaddem, Miguel Rebolloand Eva Onaindia

In this paper, we analyze the sentiments derived from the

conversations that occur in social networks. Our goal is to identify

the sentiments of the users in the social network through their

conversations. We conduct a study to determine whether users

of social networks (twitter in particular) tend to gather together

according to the likeness of their sentiments. In our proposed

framework, (1) we use ANEW, a lexical dictionary to identify

affective emotional feelings associated to a message according to

the Russell’s model of affection; (2) we design a topic modeling

mechanism called Sent_ LDA, based on the Latent Dirichlet

Allocation (LDA) generative model, which allows us to find

the topic distribution in a general conversation and we associate

topics with emotions; (3) we detect communities in the network

according to the density and frequency of the messages among the

users; and (4) we compare the sentiments of the communities by

using the Russell’s model of affect versus polarity and we measure

the extent to which topic distribution strengthen likeness in the

sentiments of the users of a community. This works contributes

2

with a topic modeling methodology to analyze the sentiments in

conversations that take place in social networks.

A Comparison of Domain-based Word PolarityEstimation using different Word Embeddings

Aitor García Pablos, Montse Cuadros and German Rigau

A key point in Sentiment Analysis is to determine the polarity

of the sentiment implied by a certain word or expression. In

basic Sentiment Analysis systems this sentiment polarity of the

words is accounted and weighted in different ways to provide a

degree of positivity/negativity. Currently words are also modelled

as continuous dense vectors, known as word embeddings, which

seem to encode interesting semantic knowledge. With regard

to Sentiment Analysis, word embeddings are used as features

to more complex supervised classification systems to obtain

sentiment classifiers. In this paper we compare a set of existing

sentiment lexicons and sentiment lexicon generation techniques.

We also show a simple but effective technique to calculate a word

polarity value for each word in a domain using existing continuous

word embeddings generation methods. Further, we also show

that word embeddings calculated on in-domain corpus capture

the polarity better than the ones calculated on general-domain

corpus.

Could Speaker, Gender or Age Awareness bebeneficial in Speech-based Emotion Recognition?

Maxim Sidorov, Alexander Schmitt, Eugene Semenkin andWolfgang Minker

Emotion Recognition (ER) is an important part of dialogue

analysis which can be used in order to improve the quality of

Spoken Dialogue Systems (SDSs). The emotional hypothesis

of the current response of an end-user might be utilised by

the dialogue manager component in order to change the SDS

strategy which could result in a quality enhancement. In this

study additional speaker-related information is used to improve

the performance of the speech-based ER process. The analysed

information is the speaker identity, gender and age of a user. Two

schemes are described here, namely, using additional information

as an independent variable within the feature vector and creating

separate emotional models for each speaker, gender or age-cluster

independently. The performances of the proposed approaches

were compared against the baseline ER system, where no

additional information has been used, on a number of emotional

speech corpora of German, English, Japanese and Russian. The

study revealed that for some of the corpora the proposed approach

significantly outperforms the baseline methods with a relative

difference of up to 11.9%.

O3 - Corpora for Language AnalysisWednesday, May 25, 11:35

Chairperson: Stelios Piperidis Oral Session

Discriminative Analysis of Linguistic Features forTypological Study

Hiroya Takamura, Ryo Nagata and Yoshifumi Kawasaki

We address the task of automatically estimating the missing values

of linguistic features by making use of the fact that some linguistic

features in typological databases are informative to each other.

The questions to address in this work are (i) how much predictive

power do features have on the value of another feature? (ii) to

what extent can we attribute this predictive power to genealogical

or areal factors, as opposed to being provided by tendencies or

implicational universals? To address these questions, we conduct

a discriminative or predictive analysis on the typological database.

Specifically, we use a machine-learning classifier to estimate the

value of each feature of each language using the values of the other

features, under different choices of training data: all the other

languages, or all the other languages except for the ones having

the same origin or area with the target language.

POS-tagging of Historical Dutch

Dieuwke Hupkes and Rens Bod

We present a study of the adequacy of current methods that

are used for POS-tagging historical Dutch texts, as well as an

exploration of the influence of employing different techniques to

improve upon the current practice. The main focus of this paper is

on (unsupervised) methods that are easily adaptable for different

domains without requiring extensive manual input. It was found

that modernising the spelling of corpora prior to tagging them with

a tagger trained on contemporary Dutch results in a large increase

in accuracy, but that spelling normalisation alone is not sufficient

to obtain state-of-the-art results. The best results were achieved

by training a POS-tagger on a corpus automatically annotated by

projecting (automatically assigned) POS-tags via word alignments

from a contemporary corpus. This result is promising, as it

was reached without including any domain knowledge or context

dependencies. We argue that the insights of this study combined

with semi-supervised learning techniques for domain adaptation

3

can be used to develop a general-purpose diachronic tagger for

Dutch.

A Language Resource of German Errors Writtenby Children with Dyslexia

Maria Rauschenberger, Luz Rello, Silke Füchsel and JörgThomaschewski

In this paper we present a language resource for German,

composed of a list of 1,021 unique errors extracted from a

collection of texts written by people with dyslexia. The errors

were annotated with a set of linguistic characteristics as well as

visual and phonetic features. We present the compilation and the

annotation criteria for the different types of dyslexic errors. This

language resource has many potential uses since errors written by

people with dyslexia reflect their difficulties. For instance, it has

already been used to design language exercises to treat dyslexia in

German. To the best of our knowledge, this is first resource of this

kind in German.

CItA: an L1 Italian Learners Corpus to Study theDevelopment of Writing Competence

Alessia Barbagli, Pietro Lucisano, Felice Dell’Orletta,Simonetta Montemagni and Giulia Venturi

In this paper, we present the CItA corpus (Corpus Italiano di

Apprendenti L1), a collection of essays written by Italian L1

learners collected during the first and second year of lower

secondary school. The corpus was built in the framework of

an interdisciplinary study jointly carried out by computational

linguistics and experimental pedagogists and aimed at tracking the

development of written language competence over the years and

students’ background information.

If You Even Don’t Have a Bit of Bible: LearningDelexicalized POS Taggers

Zhiwei Yu, David Marecek, Zdenek Žabokrtský and DanielZeman

Part-of-speech (POS) induction is one of the most popular

tasks in research on unsupervised NLP. Various unsupervised

and semi-supervised methods have been proposed to tag an

unseen language. However, many of them require some partial

understanding of the target language because they rely on

dictionaries or parallel corpora such as the Bible. In this paper,

we propose a different method named delexicalized tagging, for

which we only need a raw corpus of the target language. We

transfer tagging models trained on annotated corpora of one or

more resource-rich languages. We employ language-independent

features such as word length, frequency, neighborhood entropy,

character classes (alphabetic vs. numeric vs. punctuation) etc.

We demonstrate that such features can, to certain extent, serve as

predictors of the part of speech, represented by the universal POS

tag.

O4 - Spoken Corpus DialogueWednesday, May 25, 11:35

Chairperson: Asuncion Moreno Oral Session

The SpeDial datasets: datasets for SpokenDialogue Systems analyticsJosé Lopes, Arodami Chorianopoulou, ElisavetPalogiannidi, Helena Moniz, Alberto Abad, KaterinaLouka, Elias Iosif and Alexandros Potamianos

The SpeDial consortium is sharing two datasets that were used

during the SpeDial project. By sharing them with the community

we are providing a resource to reduce the duration of cycle of

development of new Spoken Dialogue Systems (SDSs). The

datasets include audios and several manual annotations, i.e.,

miscommunication, anger, satisfaction, repetition, gender and

task success. The datasets were created with data from real

users and cover two different languages: English and Greek.

Detectors for miscommunication, anger and gender were trained

for both systems. The detectors were particularly accurate

in tasks where humans have high annotator agreement such

as miscommunication and gender. As expected due to the

subjectivity of the task, the anger detector had a less satisfactory

performance. Nevertheless, we proved that the automatic

detection of situations that can lead to problems in SDSs is

possible and can be a promising direction to reduce the duration

of SDS’s development cycle.

Creating Annotated Dialogue Resources:Cross-domain Dialogue Act ClassificationDilafruz Amanova, Volha Petukhova and Dietrich Klakow

This paper describes a method to automatically create dialogue

resources annotated with dialogue act information by reusing

existing dialogue corpora. Numerous dialogue corpora are

available for research purposes and many of them are annotated

with dialogue act information that captures the intentions encoded

in user utterances. Annotated dialogue resources, however, differ

in various respects: data collection settings and modalities used,

dialogue task domains and scenarios (if any) underlying the

collection, number and roles of dialogue participants involved

and dialogue act annotation schemes applied. The presented

study encompasses three phases of data-driven investigation.

We, first, assess the importance of various types of features

and their combinations for effective cross-domain dialogue act

classification. Second, we establish the best predictive model

comparing various cross-corpora training settings. Finally, we

4

specify models adaptation procedures and explore late fusion

approaches to optimize the overall classification decision taking

process. The proposed methodology accounts for empirically

motivated and technically sound classification procedures that

may reduce annotation and training costs significantly.

Towards a Multi-dimensional Taxonomy ofStories in Dialogue

Kathryn J. Collins and David Traum

In this paper, we present a taxonomy of stories told in dialogue.

We based our scheme on prior work analyzing narrative structure

and method of telling, relation to storyteller identity, as well

as some categories particular to dialogue, such as how the

story gets introduced. Our taxonomy currently has 5 major

dimensions, with most having sub-dimensions - each dimension

has an associated set of dimension-specific labels. We adapted

an annotation tool for this taxonomy and have annotated portions

of two different dialogue corpora, Switchboard and the Distress

Analysis Interview Corpus. We present examples of some of the

tags and concepts with stories from Switchboard, and some initial

statistics of frequencies of the tags.

PentoRef: A Corpus of Spoken References inTask-oriented Dialogues

Sina Zarrieß, Julian Hough, Casey Kennington, RameshManuvinakurike, David DeVault, Raquel Fernandez andDavid Schlangen

PentoRef is a corpus of task-oriented dialogues collected in

systematically manipulated settings. The corpus is multilingual,

with English and German sections, and overall comprises more

than 20000 utterances. The dialogues are fully transcribed

and annotated with referring expressions mapped to objects

in corresponding visual scenes, which makes the corpus a

rich resource for research on spoken referring expressions in

generation and resolution. The corpus includes several sub-

corpora that correspond to different dialogue situations where

parameters related to interactivity, visual access, and verbal

channel have been manipulated in systematic ways. The

corpus thus lends itself to very targeted studies of reference in

spontaneous dialogue.

Transfer of Corpus-Specific Dialogue ActAnnotation to ISO Standard: Is it worth it?

Shammur Absar Chowdhury, Evgeny Stepanov andGiuseppe Riccardi

Spoken conversation corpora often adapt existing Dialogue Act

(DA) annotation specifications, such as DAMSL, DIT++, etc.,

to task specific needs, yielding incompatible annotations; thus,

limiting corpora re-usability. Recently accepted ISO standard for

DA annotation – Dialogue Act Markup Language (DiAML) –

is designed as domain and application independent. Moreover,

the clear separation of dialogue dimensions and communicative

functions, coupled with the hierarchical organization of the

latter, allows for classification at different levels of granularity.

However, re-annotating existing corpora with the new scheme

might require significant effort. In this paper we test the utility of

the ISO standard through comparative evaluation of the corpus-

specific legacy and the semi-automatically transferred DiAML

DA annotations on supervised dialogue act classification task. To

test the domain independence of the resulting annotations, we

perform cross-domain and data aggregation evaluation. Compared

to the legacy annotation scheme, on the Italian LUNA Human-

Human corpus, the DiAML annotation scheme exhibits better

cross-domain and data aggregation classification performance,

while maintaining comparable in-domain performance.

P01 - Anaphora and CoreferenceWednesday, May 25, 11:35

Chairperson: Steve Cassidy Poster Session

WikiCoref: An English Coreference-annotatedCorpus of Wikipedia ArticlesAbbas Ghaddar and Phillippe Langlais

This paper presents WikiCoref, an English corpus annotated for

anaphoric relations, where all documents are from the English

version of Wikipedia. Our annotation scheme follows the one of

OntoNotes with a few disparities. We annotated each markable

with coreference type, mention type and the equivalent Freebase

topic. Since most similar annotation efforts concentrate on

very specific types of written text, mainly newswire, there is a

lack of resources for otherwise over-used Wikipedia texts. The

corpus described in this paper addresses this issue. We present

a freely available resource we initially devised for improving

coreference resolution algorithms dedicated to Wikipedia texts.

Our corpus has no restriction on the topics of the documents being

annotated, and documents of various sizes have been considered

for annotation.

Exploitation of Co-reference in DistributionalSemanticsDominik Schlechtweg

The aim of distributional semantics is to model the similarity

of the meaning of words via the words they occur with.

Thereby, it relies on the distributional hypothesis implying

that similar words have similar contexts. Deducing meaning

from the distribution of words is interesting as it can be done

automatically on large amounts of freely available raw text.

5

It is because of this convenience that most current state-of-

the-art-models of distributional semantics operate on raw text,

although there have been successful attempts to integrate other

kinds of—e.g., syntactic—information to improve distributional

semantic models. In contrast, less attention has been paid to

semantic information in the research community. One reason

for this is that the extraction of semantic information from raw

text is a complex, elaborate matter and in great parts not yet

satisfyingly solved. Recently, however, there have been successful

attempts to integrate a certain kind of semantic information,

i.e., co-reference. Two basically different kinds of information

contributed by co-reference with respect to the distribution of

words will be identified. We will then focus on one of these and

examine its general potential to improve distributional semantic

models as well as certain more specific hypotheses.

Adapting an Entity Centric Model for PortugueseCoreference Resolution

Evandro Fonseca, Renata Vieira and Aline Vanin

This paper presents the adaptation of an Entity Centric Model

for Portuguese coreference resolution, considering 10 named

entity categories. The model was evaluated on named e using

the HAREM Portuguese corpus and the results are 81.0% of

precision and 58.3% of recall overall, the resulting system is freely

available.

IMS HotCoref DE: A Data-driven Co-referenceResolver for German

Ina Roesiger and Jonas Kuhn

This paper presents a data-driven co-reference resolution system

for German that has been adapted from IMS HotCoref, a co-

reference resolver for English. It describes the difficulties when

resolving co-reference in German text, the adaptation process and

the features designed to address linguistic challenges brought forth

by German. We report performance on the reference dataset

TüBa-D/Z and include a post-task SemEval 2010 evaluation,

showing that the resolver achieves state-of-the-art performance.

We also include ablation experiments that indicate that integrating

linguistic features increases results. The paper also describes the

steps and the format necessary to use the resolver on new texts.

The tool is freely available for download.

Coreference Annotation Scheme and RelationTypes for Hindi

Vandan Mujadia, Palash Gupta and Dipti Misra Sharma

This paper describes a coreference annotation scheme,

coreference annotation specific issues and their solutions through

our proposed annotation scheme for Hindi. We introduce different

co-reference relation types between continuous mentions of the

same coreference chain such as “Part-of”, “Function-value pair”

etc. We used Jaccard similarity based Krippendorff’s alpha to

demonstrate consistency in annotation scheme, annotation and

corpora. To ease the coreference annotation process, we built

a semi-automatic Coreference Annotation Tool (CAT). We also

provide statistics of coreference annotation on Hindi Dependency

Treebank (HDTB).

Coreference in Prague Czech-English DependencyTreebank

Anna Nedoluzhko, Michal Novák, Silvie Cinkova, MarieMikulová and Jirí Mírovský

We present coreference annotation on parallel Czech-English texts

of the Prague Czech-English Dependency Treebank (PCEDT).

The paper describes innovations made to PCEDT 2.0 concerning

coreference, as well as coreference information already present

there. We characterize the coreference annotation scheme, give

the statistics and compare our annotation with the coreference

annotation in Ontonotes and Prague Dependency Treebank for

Czech. We also present the experiments made using this corpus to

improve the alignment of coreferential expressions, which helps

us to collect better statistics of correspondences between types of

coreferential relations in Czech and English. The corpus released

as PCEDT 2.0 Coref is publicly available.

Sieve-based Coreference Resolution in theBiomedical Domain

Dane Bell, Gus Hahn-Powell, Marco A. Valenzuela-Escárcega and Mihai Surdeanu

We describe challenges and advantages unique to coreference

resolution in the biomedical domain, and a sieve-based

architecture that leverages domain knowledge for both entity

and event coreference resolution. Domain-general coreference

resolution algorithms perform poorly on biomedical documents,

because the cues they rely on such as gender are largely absent

in this domain, and because they do not encode domain-specific

knowledge such as the number and type of participants required

in chemical reactions. Moreover, it is difficult to directly encode

this knowledge into most coreference resolution algorithms

because they are not rule-based. Our rule-based architecture

uses sequentially applied hand-designed “sieves”, with the output

of each sieve informing and constraining subsequent sieves.

This architecture provides a 3.2% increase in throughput to our

Reach event extraction system with precision parallel to that of

6

the stricter system that relies solely on syntactic patterns for

extraction.

Annotating Characters in Literary Corpora: AScheme, the CHARLES Tool, and an AnnotatedNovel

Hardik Vala, Stefan Dimitrov, David Jurgens, Andrew Piperand Derek Ruths

Characters form the focus of various studies of literary works,

including social network analysis, archetype induction, and plot

comparison. The recent rise in the computational modelling of

literary works has produced a proportional rise in the demand

for character-annotated literary corpora. However, automatically

identifying characters is an open problem and there is low

availability of literary texts with manually labelled characters. To

address the latter problem, this work presents three contributions:

(1) a comprehensive scheme for manually resolving mentions

to characters in texts. (2) A novel collaborative annotation

tool, CHARLES (CHAracter Resolution Label-Entry System) for

character annotation and similiar cross-document tagging tasks.

(3) The character annotations resulting from a pilot study on the

novel Pride and Prejudice, demonstrating the scheme and tool

facilitate the efficient production of high-quality annotations. We

expect this work to motivate the further production of annotated

literary corpora to help meet the demand of the community.

P02 - Computer Aided Language LearningWednesday, May 25, 11:35

Chairperson: Stephanie Strassel Poster Session

Error Typology and Remediation Strategies forRequirements Written in English by Non-NativeSpeakers

Marie Garnier and Patrick Saint-Dizier

In most international industries, English is the main language of

communication for technical documents. These documents are

designed to be as unambiguous as possible for their users. For

international industries based in non-English speaking countries,

the professionals in charge of writing requirements are often non-

native speakers of English, who rarely receive adequate training

in the use of English for this task. As a result, requirements

can contain a relatively large diversity of lexical and grammatical

errors, which are not eliminated by the use of guidelines from

controlled languages. This article investigates the distribution

of errors in a corpus of requirements written in English by

native speakers of French. Errors are defined on the basis of

grammaticality and acceptability principles, and classified using

comparable categories. Results show a high proportion of errors

in the Noun Phrase, notably through modifier stacking, and

errors consistent with simplification strategies. Comparisons

with similar corpora in other genres reveal the specificity of

the distribution of errors in requirements. This research also

introduces possible applied uses, in the form of strategies for the

automatic detection of errors, and in-person training provided by

certification boards in requirements authoring.

Improving POS Tagging of German LearnerLanguage in a Reading Comprehension Scenario

Lena Keiper, Andrea Horbach and Stefan Thater

We present a novel method to automatically improve the accurracy

of part-of-speech taggers on learner language. The key idea

underlying our approach is to exploit the structure of a typical

language learner task and automatically induce POS information

for out-of-vocabulary (OOV) words. To evaluate the effectiveness

of our approach, we add manual POS and normalization

information to an existing language learner corpus. Our evaluation

shows an increase in accurracy from 72.4% to 81.5% on OOV

words.

SweLL on the rise: Swedish Learner Languagecorpus for European Reference Level studies

Elena Volodina, Ildikó Pilán, Ingegerd Enström, LorenaLlozhi, Peter Lundkvist, Gunlög Sundberg and MonicaSandell

We present a new resource for Swedish, SweLL, a corpus of

Swedish Learner essays linked to learners’ performance according

to the Common European Framework of Reference (CEFR).

SweLL consists of three subcorpora – SpIn, SW1203 and Tisus,

collected from three different educational establishments. The

common metadata for all subcorpora includes age, gender, native

languages, time of residence in Sweden, type of written task.

Depending on the subcorpus, learner texts may contain additional

information, such as text genres, topics, grades. Five of

the six CEFR levels are represented in the corpus: A1, A2,

B1, B2 and C1 comprising in total 339 essays. C2 level is

not included since courses at C2 level are not offered. The

work flow consists of collection of essays and permits, essay

digitization and registration, meta-data annotation, automatic

linguistic annotation. Inter-rater agreement is presented on the

basis of SW1203 subcorpus. The work on SweLL is still ongoing

with more that 100 essays waiting in the pipeline. This article both

7

describes the resource and the “how-to” behind the compilation of

SweLL.

SVALex: a CEFR-graded Lexical Resource forSwedish Foreign and Second Language LearnersThomas Francois, Elena Volodina, Ildikó Pilán and AnaïsTack

The paper introduces SVALex, a lexical resource primarily

aimed at learners and teachers of Swedish as a foreign and

second language that describes the distribution of 15,681 words

and expressions across the Common European Framework of

Reference (CEFR). The resource is based on a corpus of

coursebook texts, and thus describes receptive vocabulary learners

are exposed to during reading activities, as opposed to productive

vocabulary they use when speaking or writing. The paper

describes the methodology applied to create the list and to estimate

the frequency distribution. It also discusses some characteristics

of the resulting resource and compares it to other lexical resources

for Swedish. An interesting feature of this resource is the

possibility to separate the wheat from the chaff, identifying the

core vocabulary at each level, i.e. vocabulary shared by several

coursebook writers at each level, from peripheral vocabulary

which is used by the minority of the coursebook writers.

Detecting Word Usage Errors in ChineseSentences for Learning Chinese as a ForeignLanguageYow-Ting Shiue and Hsin-Hsi Chen

Automated grammatical error detection, which helps users

improve their writing, is an important application in NLP.

Recently more and more people are learning Chinese, and an

automated error detection system can be helpful for the learners.

This paper proposes n-gram features, dependency count features,

dependency bigram features, and single-character features to

determine if a Chinese sentence contains word usage errors, in

which a word is written as a wrong form or the word selection

is inappropriate. With marking potential errors on the level of

sentence segments, typically delimited by punctuation marks, the

learner can try to correct the problems without the assistant of a

language teacher. Experiments on the HSK corpus show that the

classifier combining all sets of features achieves an accuracy of

0.8423. By utilizing certain combination of the sets of features,

we can construct a system that favors precision or recall. The

best precision we achieve is 0.9536, indicating that our system is

reliable and seldom produces misleading results.

LibN3L:A Lightweight Package for Neural NLPMeishan Zhang, Jie Yang, Zhiyang Teng and Yue Zhang

We present a light-weight machine learning tool for NLP research.

The package supports operations on both discrete and dense

vectors, facilitating implementation of linear models as well as

neural models. It provides several basic layers which mainly

aims for single-layer linear and non-linear transformations. By

using these layers, we can conveniently implement linear models

and simple neural models. Besides, this package also integrates

several complex layers by composing those basic layers, such as

RNN, Attention Pooling, LSTM and gated RNN. Those complex

layers can be used to implement deep neural models directly.

Evaluating Lexical Simplification and VocabularyKnowledge for Learners of French: Possibilities ofUsing the FLELex Resource

Anaïs Tack, Thomas Francois, Anne-Laure Ligozat andCédrick Fairon

This study examines two possibilities of using the FLELex graded

lexicon for the automated assessment of text complexity in French

as a foreign language learning. From the lexical frequency

distributions described in FLELex, we derive a single level of

difficulty for each word in a parallel corpus of original and

simplified texts. We then use this data to automatically address

the lexical complexity of texts in two ways. On the one hand, we

evaluate the degree of lexical simplification in manually simplified

texts with respect to their original version. Our results show

a significant simplification effect, both in the case of French

narratives simplified for non-native readers and in the case of

simplified Wikipedia texts. On the other hand, we define a

predictive model which identifies the number of words in a text

that are expected to be known at a particular learning level.

We assess the accuracy with which these predictions are able to

capture actual word knowledge as reported by Dutch-speaking

learners of French. Our study shows that although the predictions

seem relatively accurate in general (87.4% to 92.3%), they do not

yet seem to cover the learners’ lack of knowledge very well.

A Shared Task for Spoken CALL?

Claudia Baur, Johanna Gerlach, Manny Rayner, MartinRussell and Helmer Strik

We argue that the field of spoken CALL needs a shared task

in order to facilitate comparisons between different groups and

methodologies, and describe a concrete example of such a task,

based on data collected from a speech-enabled online tool which

has been used to help young Swiss German teens practise skills in

English conversation. Items are prompt-response pairs, where the

prompt is a piece of German text and the response is a recorded

English audio file. The task is to label pairs as “accept” or “reject”,

accepting responses which are grammatically and linguistically

correct to match a set of hidden gold standard answers as closely

as possible. Initial resources are provided so that a scratch system

8

can be constructed with a minimal investment of effort, and in

particular without necessarily using a speech recogniser. Training

data for the task will be released in June 2016, and test data in

January 2017.

Joining-in-type Humanoid Robot AssistedLanguage Learning System

AlBara Khalifa, Tsuneo Kato and Seiichi Yamamoto

Dialogue robots are attractive to people, and in language

learning systems, they motivate learners and let them practice

conversational skills in more realistic environment. However,

automatic speech recognition (ASR) of the second language (L2)

learners is still a challenge, because their speech contains not just

pronouncing, lexical, grammatical errors, but is sometimes totally

disordered. Hence, we propose a novel robot assisted language

learning (RALL) system using two robots, one as a teacher and

the other as an advanced learner. The system is designed to

simulate multiparty conversation, expecting implicit learning and

enhancement of predictability of learners’ utterance through an

alignment similar to “interactive alignment”, which is observed

in human-human conversation. We collected a database with the

prototypes, and measured how much the alignment phenomenon

observed in the database with initial analysis.

P03 - Evaluation Methodologies (1)Wednesday, May 25, 11:35

Chairperson: Ann Bies Poster Session

OSMAN – A Novel Arabic Readability Metric

Mahmoud El-Haj and Paul Rayson

We present OSMAN (Open Source Metric for Measuring Arabic

Narratives) - a novel open source Arabic readability metric and

tool. It allows researchers to calculate readability for Arabic text

with and without diacritics. OSMAN is a modified version of

the conventional readability formulas such as Flesch and Fog.

In our work we introduce a novel approach towards counting

short, long and stress syllables in Arabic which is essential for

judging readability of Arabic narratives. We also introduce an

additional factor called “Faseeh” which considers aspects of script

usually dropped in informal Arabic writing. To evaluate our

methods we used Spearman’s correlation metric to compare text

readability for 73,000 parallel sentences from English and Arabic

UN documents. The Arabic sentences were written with the

absence of diacritics and in order to count the number of syllables

we added the diacritics in using an open source tool called

Mishkal. The results show that OSMAN readability formula

correlates well with the English ones making it a useful tool for

researchers and educators working with Arabic text.

Evaluating Interactive System Adaptation

Edouard Geoffrois

Enabling users of intelligent systems to enhance the system

performance by providing feedback on their errors is an important

need. However, the ability of systems to learn from user feedback

is difficult to evaluate in an objective and comparative way.

Indeed, the involvement of real users in the adaptation process

is an impediment to objective evaluation. This issue can be

solved by using an oracle approach, where users are simulated

by oracles having access to the reference test data. Another

difficulty is to find a meaningful metric despite the fact that system

improvements depend on the feedback provided and on the system

itself. A solution is to measure the minimal amount of information

needed to correct all system errors. It can be shown that for

any well defined non interactive task, the interactively supervised

version of the task can be evaluated by combining such an oracle-

based approach and a minimum supervision rate metric. This new

evaluation protocol for adaptive systems is not only expected to

drive progress for such systems, but also to pave the way for a

specialisation of actors along the value chain of their technological

development.

Complementarity, F-score, and NLP Evaluation

Leon Derczynski

This paper addresses the problem of quantifying the differences

between entity extraction systems, where in general only a small

proportion a document should be selected. Comparing overall

accuracy is not very useful in these cases, as small differences

in accuracy may correspond to huge differences in selections

over the target minority class. Conventionally, one may use per-

token complementarity to describe these differences, but it is not

very useful when the set is heavily skewed. In such situations,

which are common in information retrieval and entity recognition,

metrics like precision and recall are typically used to describe

performance. However, precision and recall fail to describe the

differences between sets of objects selected by different decision

strategies, instead just describing the proportional amount of

correct and incorrect objects selected. This paper presents a

method for measuring complementarity for precision, recall and

9

F-score, quantifying the difference between entity extraction

approaches.

DRANZIERA: An Evaluation Protocol ForMulti-Domain Opinion Mining

Mauro Dragoni, Andrea Tettamanzi and Célia da CostaPereira

Opinion Mining is a topic which attracted a lot of interest in the

last years. By observing the literature, it is often hard to replicate

system evaluation due to the unavailability of the data used for

the evaluation or to the lack of details about the protocol used in

the campaign. In this paper, we propose an evaluation protocol,

called DRANZIERA, composed of a multi-domain dataset and

guidelines allowing both to evaluate opinion mining systems in

different contexts (Closed, Semi-Open, and Open) and to compare

them to each other and to a number of baselines.

Evaluating a Topic Modelling Approach toMeasuring Corpus Similarity

Richard Fothergill, Paul Cook and Timothy Baldwin

Web corpora are often constructed automatically, and their

contents are therefore often not well understood. One technique

for assessing the composition of such a web corpus is to

empirically measure its similarity to a reference corpus whose

composition is known. In this paper we evaluate a number of

measures of corpus similarity, including a method based on topic

modelling which has not been previously evaluated for this task.

To evaluate these methods we use known-similarity corpora that

have been previously used for this purpose, as well as a number of

newly-constructed known-similarity corpora targeting differences

in genre, topic, time, and region. Our findings indicate that,

overall, the topic modelling approach did not improve on a chi-

square method that had previously been found to work well for

measuring corpus similarity.

User, who art thou? User Profiling for OralCorpus Platforms

Christian Fandrych, Elena Frick, Hanna Hedeland, AnnaIliash, Daniel Jettka, Cordula Meißner, Thomas Schmidt,Franziska Wallner, Kathrin Weigert and Swantje Westpfahl

This contribution presents the background, design and results of a

study of users of three oral corpus platforms in Germany. Roughly

5.000 registered users of the Database for Spoken German (DGD),

the GeWiss corpus and the corpora of the Hamburg Centre

for Language Corpora (HZSK) were asked to participate in a

user survey. This quantitative approach was complemented by

qualitative interviews with selected users. We briefly introduce

the corpus resources involved in the study in section 2. Section

3 describes the methods employed in the user studies. Section

4 summarizes results of the studies focusing on selected key

topics. Section 5 attempts a generalization of these results to larger

contexts.

Building a Corpus of Errors and Quality inMachine Translation: Experiments on ErrorImpact

Angela Costa, Rui Correia and Luisa Coheur

In this paper we describe a corpus of automatic translations

annotated with both error type and quality. The 300 sentences

that we have selected were generated by Google Translate,

Systran and two in-house Machine Translation systems that

use Moses technology. The errors present on the translations

were annotated with an error taxonomy that divides errors in

five main linguistic categories (Orthography, Lexis, Grammar,

Semantics and Discourse), reflecting the language level where

the error is located. After the error annotation process, we

accessed the translation quality of each sentence using a four

point comprehension scale from 1 to 5. Both tasks of error and

quality annotation were performed by two different annotators,

achieving good levels of inter-annotator agreement. The creation

of this corpus allowed us to use it as training data for a translation

quality classifier. We concluded on error severity by observing the

outputs of two machine learning classifiers: a decision tree and a

regression model.

Evaluating the Readability of Text SimplificationOutput for Readers with Cognitive Disabilities

Victoria Yaneva, Irina Temnikova and Ruslan Mitkov

This paper presents an approach for automatic evaluation of the

readability of text simplification output for readers with cognitive

disabilities. First, we present our work towards the development

of the EasyRead corpus, which contains easy-to-read documents

created especially for people with cognitive disabilities. We

then compare the EasyRead corpus to the simplified output

contained in the LocalNews corpus (Feng, 2009), the accessibility

of which has been evaluated through reading comprehension

experiments including 20 adults with mild intellectual disability.

This comparison is made on the basis of 13 disability-specific

linguistic features. The comparison reveals that there are no

major differences between the two corpora, which shows that

the EasyRead corpus is to a similar reading level as the user-

evaluated texts. We also discuss the role of Simple Wikipedia

(Zhu et al., 2010) as a widely-used accessibility benchmark, in

10

light of our finding that it is significantly more complex than both

the EasyRead and the LocalNews corpora.

Word Embedding Evaluation and CombinationSahar Ghannay, Benoit Favre, Yannick Estève and NathalieCamelin

Word embeddings have been successfully used in several natural

language processing tasks (NLP) and speech processing. Different

approaches have been introduced to calculate word embeddings

through neural networks. In the literature, many studies focused

on word embedding evaluation, but for our knowledge, there

are still some gaps. This paper presents a study focusing on a

rigorous comparison of the performances of different kinds of

word embeddings. These performances are evaluated on different

NLP and linguistic tasks, while all the word embeddings are

estimated on the same training data using the same vocabulary,

the same number of dimensions, and other similar characteristics.

The evaluation results reported in this paper match those in the

literature, since they point out that the improvements achieved

by a word embedding in one task are not consistently observed

across all tasks. For that reason, this paper investigates and

evaluates approaches to combine word embeddings in order to

take advantage of their complementarity, and to look for the

effective word embeddings that can achieve good performances on

all tasks. As a conclusion, this paper provides new perceptions of

intrinsic qualities of the famous word embedding families, which

can be different from the ones provided by works previously

published in the scientific literature.

Benchmarking multimedia technologies with theCAMOMILE platform: the case of MultimodalPerson Discovery at MediaEval 2015Johann Poignant, Hervé Bredin, Claude Barras, MickaelStefas, Pierrick Bruneau and Thomas Tamisier

In this paper, we claim that the CAMOMILE collaborative

annotation platform (developed in the framework of the

eponymous CHIST-ERA project) eases the organization of

multimedia technology benchmarks, automating most of the

campaign technical workflow and enabling collaborative (hence

faster and cheaper) annotation of the evaluation data. This

is demonstrated through the successful organization of a

new multimedia task at MediaEval 2015, Multimodal Person

Discovery in Broadcast TV.

Evaluating the Impact of Light Post-Editing onUsabilitySheila Castilho and Sharon O’Brien

This paper discusses a methodology to measure the usability of

machine translated content by end users, comparing lightly post-

edited content with raw output and with the usability of source

language content. The content selected consists of Online Help

articles from a software company for a spreadsheet application,

translated from English into German. Three groups of five users

each used either the source text - the English version (EN) -, the

raw MT version (DE_ MT), or the light PE version (DE_ PE), and

were asked to carry out six tasks. Usability was measured using

an eye tracker and cognitive, temporal and pragmatic measures of

usability. Satisfaction was measured via a post-task questionnaire

presented after the participants had completed the tasks.

P04 - Information Extraction andRetrieval (1)Wednesday, May 25, 11:35

Chairperson: Diana Maynard Poster Session

Operational Assessment of Keyword Search onOral History

Elizabeth Salesky, Jessica Ray and Wade Shen

This project assesses the resources necessary to make oral

history searchable by means of automatic speech recognition

(ASR). There are many inherent challenges in applying ASR

to conversational speech: smaller training set sizes and varying

demographics, among others. We assess the impact of dataset

size, word error rate and term-weighted value on human search

capability through an information retrieval task on Mechanical

Turk. We use English oral history data collected by StoryCorps, a

national organization that provides all people with the opportunity

to record, share and preserve their stories, and control for a

variety of demographics including age, gender, birthplace, and

dialect on four different training set sizes. We show comparable

search performance using a standard speech recognition system

as with hand-transcribed data, which is promising for increased

accessibility of conversational speech and oral history archives.

Odin’s Runes: A Rule Language for InformationExtraction

Marco A. Valenzuela-Escárcega, Gus Hahn-Powell andMihai Surdeanu

Odin is an information extraction framework that applies cascades

of finite state automata over both surface text and syntactic

dependency graphs. Support for syntactic patterns allow us to

concisely define relations that are otherwise difficult to express

in languages such as Common Pattern Specification Language

(CPSL), which are currently limited to shallow linguistic features.

The interaction of lexical and syntactic automata provides

robustness and flexibility when writing extraction rules. This

11

paper describes Odin’s declarative language for writing these

cascaded automata.

A Classification-based Approach to EconomicEvent Detection in Dutch News Text

Els Lefever and Véronique Hoste

Breaking news on economic events such as stock splits or

mergers and acquisitions has been shown to have a substantial

impact on the financial markets. As it is important to be

able to automatically identify events in news items accurately

and in a timely manner, we present in this paper proof-of-

concept experiments for a supervised machine learning approach

to economic event detection in newswire text. For this purpose, we

created a corpus of Dutch financial news articles in which 10 types

of company-specific economic events were annotated. We trained

classifiers using various lexical, syntactic and semantic features.

We obtain good results based on a basic set of shallow features,

thus showing that this method is a viable approach for economic

event detection in news text.

Predictive Modeling: Guessing the NLP Terms ofTomorrow

Gil Francopoulo, Joseph Mariani and Patrick Paroubek

Predictive modeling, often called “predictive analytics” in

a commercial context, encompasses a variety of statistical

techniques that analyze historical and present facts to make

predictions about unknown events. Often the unknown events

are in the future, but prediction can be applied to any type of

unknown whether it be in the past or future. In our case, we

present some experiments applying predictive modeling to the

usage of technical terms within the NLP domain.

The Gavagai Living Lexicon

Magnus Sahlgren, Amaru Cuba Gyllensten, FredrikEspinoza, Ola Hamfors, Jussi Karlgren, Fredrik Olsson,Per Persson, Akshay Viswanathan and Anders Holst

This paper presents the Gavagai Living Lexicon, which is an

online distributional semantic model currently available in 20

different languages. We describe the underlying distributional

semantic model, and how we have solved some of the challenges

in applying such a model to large amounts of streaming data. We

also describe the architecture of our implementation, and discuss

how we deal with continuous quality assurance of the lexicon.

Arabic to English Person Name Transliterationusing Twitter

Hamdy Mubarak and Ahmed Abdelali

Social media outlets are providing new opportunities for

harvesting valuable resources. We present a novel approach

for mining data from Twitter for the purpose of building

transliteration resources and systems. Such resources are crucial

in translation and retrieval tasks. We demonstrate the benefits

of the approach on Arabic to English transliteration. The

contribution of this approach includes the size of data that

can be collected and exploited within the span of a limited

time; the approach is very generic and can be adopted to other

languages and the ability of the approach to cope with new

transliteration phenomena and trends. A statistical transliteration

system built using this data improved a comparable system built

from Wikipedia wikilinks data.

Korean TimeML and Korean TimeBankYoung-Seob Jeong, Won-Tae Joo, Hyun-Woo Do, Chae-Gyun Lim, Key-Sun Choi and Ho-Jin Choi

Many emerging documents usually contain temporal information.

Because the temporal information is useful for various

applications, it became important to develop a system of

extracting the temporal information from the documents. Before

developing the system, it first necessary to define or design the

structure of temporal information. In other words, it is necessary

to design a language which defines how to annotate the temporal

information. There have been some studies about the annotation

languages, but most of them was applicable to only a specific

target language (e.g., English). Thus, it is necessary to design

an individual annotation language for each language. In this

paper, we propose a revised version of Koreain Time Mark-

up Language (K-TimeML), and also introduce a dataset, named

Korean TimeBank, that is constructed basd on the K-TimeML.

We believe that the new K-TimeML and Korean TimeBank will

be used in many further researches about extraction of temporal

information.

A Large DataBase of Hypernymy RelationsExtracted from the Web.Julian Seitner, Christian Bizer, Kai Eckert, Stefano Faralli,Robert Meusel, Heiko Paulheim and Simone Paolo Ponzetto

Hypernymy relations (those where an hyponym term shares a

“isa”relationship with his hypernym) play a key role for many

Natural Language Processing (NLP) tasks, e.g. ontology learning,

automatically building or extending knowledge bases, or word

sense disambiguation and induction. In fact, such relations may

provide the basis for the construction of more complex structures

such as taxonomies, or be used as effective background knowledge

for many word understanding applications. We present a publicly

available database containing more than 400 million hypernymy

relations we extracted from the CommonCrawl web corpus. We

describe the infrastructure we developed to iterate over the web

12

corpus for extracting the hypernymy relations and store them

effectively into a large database. This collection of relations

represents a rich source of knowledge and may be useful for

many researchers. We offer the tuple dataset for public download

and an Application Programming Interface (API) to help other

researchers programmatically query the database.

Using a Cross-Language Information RetrievalSystem based on OHSUMED to Evaluate theMoses and KantanMT Statistical MachineTranslation Systems

Nikolaos Katris, Richard Sutcliffe and TheodoreKalamboukis

The objective of this paper was to evaluate the performance of

two statistical machine translation (SMT) systems within a cross-

language information retrieval (CLIR) architecture and examine

if there is a correlation between translation quality and CLIR

performance. The SMT systems were KantanMT, a cloud-based

machine translation (MT) platform, and Moses, an open-source

MT application. First we trained both systems using the same

language resources: the EMEA corpus for the translation model

and language model and the QTLP corpus for tuning. Then

we translated the 63 queries of the OHSUMED test collection

from Greek into English using both MT systems. Next, we ran

the queries on the document collection using Apache Solr to

get a list of the top ten matches. The results were compared

to the OHSUMED gold standard. KantanMT achieved higher

average precision and F-measure than Moses, while both systems

produced the same recall score. We also calculated the BLEU

score for each system using the ECDC corpus. Moses achieved

a higher BLEU score than KantanMT. Finally, we also tested the

IR performance of the original English queries. This work overall

showed that CLIR performance can be better even when BLEU

score is worse.

Two Decades of Terminology: EuropeanFramework Programmes Titles

Gabriella Pardelli, Sara Goggi, Silvia Giannini andStefania Biagioni

This work analyses a corpus made of the titles of research projects

belonging to the last four European Commission Framework

Programmes (FP4, FP5, FP6, FP7) during a time span of

nearly two decades (1994-2012). The starting point is the

idea of creating a corpus of titles which would constitute a

terminological niche, a sort of “cluster map” offering an overall

vision on the terms used and the links between them. Moreover,

by performing a terminological comparison over a period of

time it is possible to trace the presence of obsolete words

in outdated research areas as well as of neologisms in the

most recent fields. Within this scenario, the minimal purpose

is to build a corpus of titles of European projects belonging

to the several Framework Programmes in order to obtain a

terminological mapping of relevant words in the various research

areas: particularly significant would be those terms spread across

different domains or those extremely tied to a specific domain.

A term could actually be found in many fields and being able to

acknowledge and retrieve this cross-presence means being able

to linking those different domains by means of a process of

terminological mapping.

Legal Text Interpretation: Identifying HohfeldianRelations from Text

Wim Peters and Adam Wyner

The paper investigates the extent of the support semi-automatic

analysis can provide for the specific task of assigning Hohfeldian

relations of Duty, using the General Architecture for Text

Engineering tool for the automated extraction of Duty instances

and the bearers of associated roles. The outcome of the analysis

supports scholars in identifying Hohfeldian structures in legal

text when performing close reading of the texts. A cyclic

workflow involving automated annotation and expert feedback

will incrementally increase the quality and coverage of the

automatic extraction process, and increasingly reduce the amount

of manual work required of the scholar.

Analysis of English Spelling Errors in aWord-Typing Game

Ryuichi Tachibana and Mamoru Komachi

The emergence of the web has necessitated the need to detect

and correct noisy consumer-generated texts. Most of the previous

studies on English spelling-error extraction collected English

spelling errors from web services such as Twitter by using the edit

distance or from input logs utilizing crowdsourcing. However, in

the former approach, it is not clear which word corresponds to the

spelling error, and the latter approach requires an annotation cost

for the crowdsourcing. One notable exception is Rodrigues and

Rytting (2012), who proposed to extract English spelling errors

by using a word-typing game. Their approach saves the cost

of crowdsourcing, and guarantees an exact alignment between

the word and the spelling error. However, they did not assert

whether the extracted spelling error corpora reflect the usual

writing process such as writing a document. Therefore, we

propose a new correctable word-typing game that is more similar

13

to the actual writing process. Experimental results showed that we

can regard typing-game logs as a source of spelling errors.

Finding Definitions in Large Corpora with SketchEngine

Vojtech Kovár, Monika Mociariková and Pavel Rychlý

The paper describes automatic definition finding implemented

within the leading corpus query and management tool, Sketch

Engine. The implementation exploits complex pattern-matching

queries in the corpus query language (CQL) and the indexing

mechanism of word sketches for finding and storing definition

candidates throughout the corpus. The approach is evaluated for

Czech and English corpora, showing that the results are usable

in practice: precision of the tool ranges between 30 and 75

percent (depending on the major corpus text types) and we were

able to extract nearly 2 million definition candidates from an

English corpus with 1.4 billion words. The feature is embedded

into the interface as a concordance filter, so that users can

search for definitions of any query to the corpus, including very

specific multi-word queries. The results also indicate that ordinary

texts (unlike explanatory texts) contain rather low number of

definitions, which is perhaps the most important problem with

automatic definition finding in general.

Improving Information Extraction fromWikipedia Texts using Basic English

Teresa Rodriguez-Ferreira, Adrian Rabadan, RaquelHervas and Alberto Diaz

The aim of this paper is to study the effect that the use of Basic

English versus common English has on information extraction

from online resources. The amount of online information

available to the public grows exponentially, and is potentially

an excellent resource for information extraction. The problem

is that this information often comes in an unstructured format,

such as plain text. In order to retrieve knowledge from this type

of text, it must first be analysed to find the relevant details, and

the nature of the language used can greatly impact the quality

of the extracted information. In this paper, we compare triplets

that represent definitions or properties of concepts obtained from

three online collaborative resources (English Wikipedia, Simple

English Wikipedia and Simple English Wiktionary) and study the

differences in the results when Basic English is used instead of

common English. The results show that resources written in Basic

English produce less quantity of triplets, but with higher quality.

NLP and Public Engagement: The Case of theItalian School Reform

Tommaso Caselli, Giovanni Moretti, Rachele Sprugnoli,Sara Tonelli, Damien Lanfrey and Donatella SoldaKutzmann

In this paper we present PIERINO (PIattaforma per l’Estrazione

e il Recupero di INformazione Online), a system that was

implemented in collaboration with the Italian Ministry of

Education, University and Research to analyse the citizens’

comments given in #labuonascuola survey. The platform

includes various levels of automatic analysis such as key-concept

extraction and word co-occurrences. Each analysis is displayed

through an intuitive view using different types of visualizations,

for example radar charts and sunburst. PIERINO was effectively

used to support shaping the last Italian school reform, proving the

potential of NLP in the context of policy making.

Evaluating Translation Quality and CLIRPerformance of Query Sessions

Xabier Saralegi, Eneko Agirre and Iñaki Alegria

This paper presents the evaluation of the translation quality

and Cross-Lingual Information Retrieval (CLIR) performance

when using session information as the context of queries. The

hypothesis is that previous queries provide context that helps to

solve ambiguous translations in the current query. We tested

several strategies on the TREC 2010 Session track dataset,

which includes query reformulations grouped by generalization,

specification, and drifting types. We study the Basque to

English direction, evaluating both the translation quality and CLIR

performance, with positive results in both cases. The results show

that the quality of translation improved, reducing error rate by

12% (HTER) when using session information, which improved

CLIR results 5% (nDCG). We also provide an analysis of the

improvements across the three kinds of sessions: generalization,

specification, and drifting. Translation quality improved in all

three types (generalization, specification, and drifting), and CLIR

improved for generalization and specification sessions, preserving

the performance in drifting sessions.

Construction and Analysis of a Large VietnameseText Corpus

Dieu-Thu Le and Uwe Quasthoff

This paper presents a new Vietnamese text corpus which contains

around 4.05 billion words. It is a collection of Wikipedia texts,

newspaper articles and random web texts. The paper describes the

14

process of collecting, cleaning and creating the corpus. Processing

Vietnamese texts faced several challenges, for example, different

from many Latin languages, Vietnamese language does not use

blanks for separating words, hence using common tokenizers such

as replacing blanks with word boundary does not work. A short

review about different approaches of Vietnamese tokenization

is presented together with how the corpus has been processed

and created. After that, some statistical analysis on this data is

reported including the number of syllable, average word length,

sentence length and topic analysis. The corpus is integrated into a

framework which allows searching and browsing. Using this web

interface, users can find out how many times a particular word

appears in the corpus, sample sentences where this word occurs,

its left and right neighbors.

Forecasting Emerging Trends from ScientificLiterature

Kartik Asooja, Georgeta Bordea, Gabriela Vulcu and PaulBuitelaar

Text analysis methods for the automatic identification of

emerging technologies by analyzing the scientific publications,

are gaining attention because of their socio-economic impact.

The approaches so far have been mainly focused on retrospective

analysis by mapping scientific topic evolution over time. We

propose regression based approaches to predict future keyword

distribution. The prediction is based on historical data of the

keywords, which in our case, are LREC conference proceedings.

Considering the insufficient number of data points available from

LREC proceedings, we do not employ standard time series

forecasting methods. We form a dataset by extracting the

keywords from previous year proceedings and quantify their

yearly relevance using tf-idf scores. This dataset additionally

contains ranked lists of related keywords and experts for each

keyword.

Extracting Structured Scholarly Informationfrom the Machine Translation Literature

Eunsol Choi, Matic Horvat, Jonathan May, Kevin Knightand Daniel Marcu

Understanding the experimental results of a scientific paper

is crucial to understanding its contribution and to comparing

it with related work. We introduce a structured, queryable

representation for experimental results and a baseline system that

automatically populates this representation. The representation

can answer compositional questions such as: “Which are the

best published results reported on the NIST 09 Chinese to

English dataset?” and “What are the most important methods

for speeding up phrase-based decoding?” Answering such

questions usually involves lengthy literature surveys. Current

machine reading for academic papers does not usually consider

the actual experiments, but mostly focuses on understanding

abstracts. We describe annotation work to create an initial

hscientific paper; experimental results representationi corpus. The

corpus is composed of 67 papers which were manually annotated

with a structured representation of experimental results by domain

experts. Additionally, we present a baseline algorithm that

characterizes the difficulty of the inference task.

Staggered NLP-assisted refinement for ClinicalAnnotations of Chronic Disease EventsStephen Wu, Chung-Il Wi, Sunghwan Sohn, Hongfang Liuand Young Juhn

Domain-specific annotations for NLP are often centered on real-

world applications of text, and incorrect annotations may be

particularly unacceptable. In medical text, the process of manual

chart review (of a patient’s medical record) is error-prone due to

its complexity. We propose a staggered NLP-assisted approach to

the refinement of clinical annotations, an interactive process that

allows initial human judgments to be verified or falsified by means

of comparison with an improving NLP system. We show on our

internal Asthma Timelines dataset that this approach improves the

quality of the human-produced clinical annotations.

“Who was Pietro Badoglio?” Towards a QAsystem for Italian HistoryStefano Menini, Rachele Sprugnoli and Antonio Uva

This paper presents QUANDHO (QUestion ANswering Data for

italian HistOry), an Italian question answering dataset created

to cover a specific domain, i.e. the history of Italy in the first

half of the XX century. The dataset includes questions manually

classified and annotated with Lexical Answer Types, and a set of

question-answer pairs. This resource, freely available for research

purposes, has been used to retrain a domain independent question

answering system so to improve its performances in the domain of

interest. Ongoing experiments on the development of a question

classifier and an automatic tagger of Lexical Answer Types are

also presented.

O5 - LR Infrastructures and ArchitecturesWednesday, May 25, 14:45

Chairperson: Franciska de Jong Oral Session

A Document Repository for Social Media andSpeech ConversationsAdam Funk, Robert Gaizauskas and Benoit Favre

We present a successfully implemented document repository

REST service for flexible SCRUD (search, crate, read,

15

update, delete) storage of social media conversations, using a

GATE/TIPSTER-like document object model and providing a

query language for document features. This software is currently

being used in the SENSEI research project and will be published

as open-source software before the project ends. It is, to the best

of our knowledge, the first freely available, general purpose data

repository to support large-scale multimodal (i.e., speech or text)

conversation analytics.

Towards a Linguistic Ontology with an Emphasison Reasoning and Knowledge Reuse

Artemis Parvizi, Matt Kohl, Meritxell Gonzàlez and RoserSaurí

The Dictionaries division at Oxford University Press (OUP)

is aiming to model, integrate, and publish lexical content

for 100 languages focussing on digitally under-represented

languages. While there are multiple ontologies designed for

linguistic resources, none had adequate features for meeting our

requirements, chief of which was the capability to losslessly

capture diverse features of many different languages in a

dictionary format, while supplying a framework for inferring

relations like translation, derivation, etc., between the data.

Building on valuable features of existing models, and working

with OUP monolingual and bilingual dictionary datasets, we

have designed and implemented a new linguistic ontology. The

ontology has been reviewed by a number of computational

linguists, and we are working to move more dictionary data into

it. We have also developed APIs to surface the linked data to

dictionary websites.

Providing a Catalogue of Language Resources forCommercial Users

Bente Maegaard, Lina Henriksen, Andrew Joscelyne, VesnaLusicky, Margaretha Mazura, Sussi Olsen, Claus Povlsenand Philippe Wacker

Language resources (LR) are indispensable for the development of

tools for machine translation (MT) or various kinds of computer-

assisted translation (CAT). In particular language corpora, both

parallel and monolingual are considered most important for

instance for MT, not only SMT but also hybrid MT. The Language

Technology Observatory will provide easy access to information

about LRs deemed to be useful for MT and other translation tools

through its LR Catalogue. In order to determine what aspects of

an LR are useful for MT practitioners, a user study was made,

providing a guide to the most relevant metadata and the most

relevant quality criteria. We have seen that many resources exist

which are useful for MT and similar work, but the majority are

for (academic) research or educational use only, and as such not

available for commercial use. Our work has revealed a list of gaps:

coverage gap, awareness gap, quality gap, quantity gap. The paper

ends with recommendations for a forward-looking strategy.

The Language Application Grid and Galaxy

Nancy Ide, Keith Suderman, James Pustejovsky, MarcVerhagen and Christopher Cieri

The NSF-SI2-funded LAPPS Grid project is a collaborative

effort among Brandeis University, Vassar College, Carnegie-

Mellon University (CMU), and the Linguistic Data Consortium

(LDC), which has developed an open, web-based infrastructure

through which resources can be easily accessed and within

which tailored language services can be efficiently composed,

evaluated, disseminated and consumed by researchers, developers,

and students across a wide variety of disciplines. The LAPPS

Grid project recently adopted Galaxy (Giardine et al., 2005),

a robust, well-developed, and well-supported front end for

workflow configuration, management, and persistence. Galaxy

allows data inputs and processing steps to be selected from

graphical menus, and results are displayed in intuitive plots

and summaries that encourage interactive workflows and the

exploration of hypotheses. The Galaxy workflow engine provides

significant advantages for deploying pipelines of LAPPS Grid web

services, including not only means to create and deploy locally-

run and even customized versions of the LAPPS Grid as well as

running the LAPPS Grid in the cloud, but also access to a huge

array of statistical and visualization tools that have been developed

for use in genomics research.

ELRA Activities and Services

Khalid Choukri, Valérie Mapelli, Hélène Mazo andVladimir Popescu

After celebrating its 20th anniversary in 2015, ELRA is carrying

on its strong involvement in the HLT field. To share ELRA’s

expertise of those 21 past years, this article begins with a

presentation of ELRA’s strategic Data and LR Management Plan

for a wide use by the language communities. Then, we further

report on ELRA’s activities and services provided since LREC

2014. When looking at the cataloguing and licensing activities,

we can see that ELRA has been active at making the Meta-Share

repository move toward new developments steps, supporting

Europe to obtain accurate LRs within the Connecting Europe

Facility programme, promoting the use of LR citation, creating the

ELRA License Wizard web portal.The article further elaborates on

the recent LR production activities of various written, speech and

video resources, commissioned by public and private customers.

In parallel, ELDA has also worked on several EU-funded projects

centred on strategic issues related to the European Digital Single

16

Market. The last part gives an overview of the latest dissemination

activities, with a special focus on the celebration of its 20th

anniversary organised in Dubrovnik (Croatia) and the following

up of LREC, as well as the launching of the new ELRA portal.

O6 - MultimodalityWednesday, May 25, 14:45

Chairperson: Kristiina Jokinen Oral Session

Mirroring Facial Expressions and Emotions inDyadic Conversations

Costanza Navarretta

This paper presents an investigation of mirroring facial

expressions and the emotions which they convey in dyadic

naturally occurring first encounters. Mirroring facial expressions

are a common phenomenon in face-to-face interactions, and they

are due to the mirror neuron system which has been found in

both animals and humans. Researchers have proposed that the

mirror neuron system is an important component behind many

cognitive processes such as action learning and understanding the

emotions of others. Preceding studies of the first encounters have

shown that overlapping speech and overlapping facial expressions

are very frequent. In this study, we want to determine whether

the overlapping facial expressions are mirrored or are otherwise

correlated in the encounters, and to what extent mirroring facial

expressions convey the same emotion. The results of our study

show that the majority of smiles and laughs, and one fifth of the

occurrences of raised eyebrows are mirrored in the data. Moreover

some facial traits in co-occurring expressions co-occur more often

than it would be expected by chance. Finally, amusement, and

to a lesser extent friendliness, are often emotions shared by both

participants, while other emotions indicating individual affective

states such as uncertainty and hesitancy are never showed by both

participants, but co-occur with complementary emotions such as

friendliness and support. Whether these tendencies are specific

to this type of conversations or are more common should be

investigated further.

Humor in Collective Discourse: UnsupervisedFunniness Detection in the New Yorker CartoonCaption Contest

Dragomir Radev, Amanda Stent, Joel Tetreault, AasishPappu, Aikaterini Iliakopoulou, Agustin Chanfreau,Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes,Rahul Jha and Robert Mankoff

The New Yorker publishes a weekly captionless cartoon. More

than 5,000 readers submit captions for it. The editors select three

of them and ask the readers to pick the funniest one. We describe

an experiment that compares a dozen automatic methods for

selecting the funniest caption. We show that negative sentiment,

human-centeredness, and lexical centrality most strongly match

the funniest captions, followed by positive sentiment. These

results are useful for understanding humor and also in the design

of more engaging conversational agents in text and multimodal

(vision+text) systems. As part of this work, a large set of cartoons

and captions is being made available to the community.

A Corpus of Text Data and Gaze Fixations fromAutistic and Non-Autistic Adults

Victoria Yaneva, Irina Temnikova and Ruslan Mitkov

The paper presents a corpus of text data and its corresponding

gaze fixations obtained from autistic and non-autistic readers.

The data was elicited through reading comprehension testing

combined with eye-tracking recording. The corpus consists of

1034 content words tagged with their POS, syntactic role and three

gaze-based measures corresponding to the autistic and control

participants. The reading skills of the participants were measured

through multiple-choice questions and, based on the answers

given, they were divided into groups of skillful and less-skillful

readers. This division of the groups informs researchers on

whether particular fixations were elicited from skillful or less-

skillful readers and allows a fair between-group comparison for

two levels of reading ability. In addition to describing the process

of data collection and corpus development, we present a study on

the effect that word length has on reading in autism. The corpus is

intended as a resource for investigating the particular linguistic

constructions which pose reading difficulties for people with

autism and hopefully, as a way to inform future text simplification

research intended for this population.

A Multimodal Corpus for the Assessment ofPublic Speaking Ability and Anxiety

Mathieu Chollet, Torsten Wörtwein, Louis-PhilippeMorency and Stefan Scherer

The ability to efficiently speak in public is an essential asset for

many professions and is used in everyday life. As such, tools

enabling the improvement of public speaking performance and the

assessment and mitigation of anxiety related to public speaking

would be very useful. Multimodal interaction technologies,

such as computer vision and embodied conversational agents,

have recently been investigated for the training and assessment

of interpersonal skills. Once central requirement for these

technologies is multimodal corpora for training machine learning

models. This paper addresses the need of these technologies

by presenting and sharing a multimodal corpus of public

17

speaking presentations. These presentations were collected in

an experimental study investigating the potential of interactive

virtual audiences for public speaking training. This corpus

includes audio-visual data and automatically extracted features,

measures of public speaking anxiety and personality, annotations

of participants’ behaviors and expert ratings of behavioral aspects

and overall performance of the presenters. We hope this corpus

will help other research teams in developing tools for supporting

public speaking training.

Deep Learning of Audio and Language Featuresfor Humor Prediction

Dario Bertero and Pascale Fung

We propose a comparison between various supervised machine

learning methods to predict and detect humor in dialogues.

We retrieve our humorous dialogues from a very popular TV

sitcom: “The Big Bang Theory”. We build a corpus where

punchlines are annotated using the canned laughter embedded in

the audio track. Our comparative study involves a linear-chain

Conditional Random Field over a Recurrent Neural Network and

a Convolutional Neural Network. Using a combination of word-

level and audio frame-level features, the CNN outperforms the

other methods, obtaining the best F-score of 68.5% over 66.5%

by CRF and 52.9% by RNN. Our work is a starting point to

developing more effective machine learning and neural network

models on the humor prediction task, as well as developing

machines capable in understanding humor in general.

O7 - Multiword ExpressionsWednesday, May 25, 14:45

Chairperson: Aline Villavicencio Oral Session

An Empirical Study of Arabic FormulaicSequence Extraction Methods

Ayman Alghamdi, Eric Atwell and Claire Brierley

This paper aims to implement what is referred to as the

collocation of the Arabic keywords approach for extracting

formulaic sequences (FSs) in the form of high frequency but

semantically regular formulas that are not restricted to any

syntactic construction or semantic domain. The study applies

several distributional semantic models in order to automatically

extract relevant FSs related to Arabic keywords. The data sets

used in this experiment are rendered from a new developed

corpus-based Arabic wordlist consisting of 5,189 lexical items

which represent a variety of modern standard Arabic (MSA)

genres and regions, the new wordlist being based on an

overlapping frequency based on a comprehensive comparison of

four large Arabic corpora with a total size of over 8 billion running

words. Empirical n-best precision evaluation methods are used

to determine the best association measures (AMs) for extracting

high frequency and meaningful FSs. The gold standard reference

FSs list was developed in previous studies and manually evaluated

against well-established quantitative and qualitative criteria. The

results demonstrate that the MI.log_ f AM achieved the highest

results in extracting significant FSs from the large MSA corpus,

while the T-score association measure achieved the worst results.

Rule-based Automatic Multi-word TermExtraction and Lemmatization

Ranka Stankovic, Cvetana Krstev, Ivan Obradovic, BiljanaLazic and Aleksandra Trtovac

In this paper we present a rule-based method for multi-word

term extraction that relies on extensive lexical resources in the

form of electronic dictionaries and finite-state transducers for

modelling various syntactic structures of multi-word terms. The

same technology is used for lemmatization of extracted multi-

word terms, which is unavoidable for highly inflected languages

in order to pass extracted data to evaluators and subsequently

to terminological e-dictionaries and databases. The approach is

illustrated on a corpus of Serbian texts from the mining domain

containing more than 600,000 simple word forms. Extracted and

lemmatized multi-word terms are filtered in order to reject falsely

offered lemmas and then ranked by introducing measures that

combine linguistic and statistical information (C-Value, T-Score,

LLR, and Keyness). Mean average precision for retrieval of MWU

forms ranges from 0.789 to 0.804, while mean average precision

of lemma production ranges from 0.956 to 0.960. The evaluation

showed that 94% of distinct multi-word forms were evaluated as

proper multi-word units, and among them 97% were associated

with correct lemmas.

Distribution of Valency Complements in CzechComplex Predicates: Between Verb and Noun

Václava Kettnerová and Eduard Bejcek

In this paper, we focus on Czech complex predicates formed

by a light verb and a predicative noun expressed as the

direct object. Although Czech – as an inflectional language

encoding syntactic relations via morphological cases – provides

an excellent opportunity to study the distribution of valency

complements in the syntactic structure with complex predicates,

this distribution has not been described so far. On the basis of

a manual analysis of the richly annotated data from the Prague

Dependency Treebank, we thus formulate principles governing

this distribution. In an automatic experiment, we verify these

principles on well-formed syntactic structures from the Prague

18

Dependency Treebank and the Prague Czech-English Dependency

Treebank with very satisfactory results: the distribution of 97%

of valency complements in the surface structure is governed by

the proposed principles. These results corroborate that the surface

structure formation of complex predicates is a regular process.

A Lexical Resource of Hebrew Verb-NounMulti-Word Expressions

Chaya Liebeskind and Yaakov HaCohen-Kerner

A verb-noun Multi-Word Expression (MWE) is a combination

of a verb and a noun with or without other words, in which

the combination has a meaning different from the meaning of

the words considered separately. In this paper, we present

a new lexical resource of Hebrew Verb-Noun MWEs (VN-

MWEs). The VN-MWEs of this resource were manually collected

and annotated from five different web resources. In addition,

we analyze the lexical properties of Hebrew VN-MWEs by

classifying them to three types: morphological, syntactic, and

semantic. These two contributions are essential for designing

algorithms for automatic VN-MWEs extraction. The analysis

suggests some interesting features of VN-MWEs for exploration.

The lexical resource enables to sample a set of positive examples

for Hebrew VN-MWEs. This set of examples can either be used

for training supervised algorithms or as seeds in unsupervised

bootstrapping algorithms. Thus, this resource is a first step

towards automatic identification of Hebrew VN-MWEs, which

is important for natural language understanding, generation and

translation systems.

Cross-lingual Linking of Multi-word Entities andtheir corresponding Acronyms

Guillaume Jacquet, Maud Ehrmann, Ralf Steinberger andJaakko Väyrynen

This paper reports on an approach and experiments to

automatically build a cross-lingual multi-word entity resource.

Starting from a collection of millions of acronym/expansion pairs

for 22 languages where expansion variants were grouped into

monolingual clusters, we experiment with several aggregation

strategies to link these clusters across languages. Aggregation

strategies make use of string similarity distances and translation

probabilities and they are based on vector space and graph

representations. The accuracy of the approach is evaluated against

Wikipedia’s redirection and cross-lingual linking tables. The

resulting multi-word entity resource contains 64,000 multi-word

entities with unique identifiers and their 600,000 multilingual

lexical variants. We intend to make this new resource publicly

available.

O8 - Named Entity RecognitionWednesday, May 25, 14:45

Chairperson: Yuji Matsumoto Oral Session

SemLinker, a Modular and Open SourceFramework for Named Entity Discovery andLinkingMarie-Jean Meurs, Hayda Almeida, Ludovic Jean-Louisand Eric Charton

This paper presents SemLinker, an open source system that

discovers named entities, connects them to a reference knowledge

base, and clusters them semantically. SemLinker relies on

several modules that perform surface form generation, mutual

disambiguation, entity clustering, and make use of two annotation

engines. SemLinker was evaluated in the English Entity

Discovery and Linking track of the Text Analysis Conference

on Knowledge Base Population, organized by the US National

Institute of Standards and Technology. Along with the SemLinker

source code, we release our annotation files containing the

discovered named entities, their types, and position across

processed documents.

Context-enhanced Adaptive Entity LinkingFilip Ilievski, Giuseppe Rizzo, Marieke van Erp, Julien Pluand Raphael Troncy

More and more knowledge bases are publicly available as linked

data. Since these knowledge bases contain structured descriptions

of real-world entities, they can be exploited by entity linking

systems that anchor entity mentions from text to the most

relevant resources describing those entities. In this paper, we

investigate adaptation of the entity linking task using contextual

knowledge. The key intuition is that entity linking can be

customized depending on the textual content, as well as on the

application that would make use of the extracted information. We

present an adaptive approach that relies on contextual knowledge

from text to enhance the performance of ADEL, a hybrid linguistic

and graph-based entity linking system. We evaluate our approach

on a domain-specific corpus consisting of annotated WikiNews

articles.

Named Entity Recognition on Twitter for Turkishusing Semi-supervised Learning with WordEmbeddingsEda Okur, Hakan Demir and Arzucan Özgür

Recently, due to the increasing popularity of social media, the

necessity for extracting information from informal text types,

19

such as microblog texts, has gained significant attention. In

this study, we focused on the Named Entity Recognition (NER)

problem on informal text types for Turkish. We utilized a

semi-supervised learning approach based on neural networks.

We applied a fast unsupervised method for learning continuous

representations of words in vector space. We made use of these

obtained word embeddings, together with language independent

features that are engineered to work better on informal text types,

for generating a Turkish NER system on microblog texts. We

evaluated our Turkish NER system on Twitter messages and

achieved better F-score performances than the published results

of previously proposed NER systems on Turkish tweets. Since

we did not employ any language dependent features, we believe

that our method can be easily adapted to microblog texts in other

morphologically rich languages.

Entity Linking with a Paraphrase Flavor

Maria Pershina, Yifan He and Ralph Grishman

The task of Named Entity Linking is to link entity mentions

in the document to their correct entries in a knowledge base

and to cluster NIL mentions. Ambiguous, misspelled, and

incomplete entity mention names are the main challenges in the

linking process. We propose a novel approach that combines

two state-of-the-art models — for entity disambiguation and

for paraphrase detection — to overcome these challenges. We

consider name variations as paraphrases of the same entity

mention and adopt a paraphrase model for this task. Our

approach utilizes a graph-based disambiguation model based on

Personalized Page Rank, and then refines and clusters its output

using the paraphrase similarity between entity mention strings. It

achieves a competitive performance of 80.5% in B3+F clustering

score on diagnostic TAC EDL 2014 data.

Domain Adaptation for Named EntityRecognition Using CRFs

Tian Tian, Marco Dinarelli, Isabelle Tellier and Pedro DiasCardoso

In this paper we explain how we created a labelled corpus in

English for a Named Entity Recognition (NER) task from multi-

source and multi-domain data, for an industrial partner. We

explain the specificities of this corpus with examples and describe

some baseline experiments. We present some results of domain

adaptation on this corpus using a labelled Twitter corpus (Ritter

et al., 2011). We tested a semi-supervised method from (Garcia-

Fernandez et al., 2014) combined with a supervised domain

adaptation approach proposed in (Raymond and Fayolle, 2010) for

machine learning experiments with CRFs (Conditional Random

Fields). We use the same technique to improve the NER results

on the Twitter corpus (Ritter et al., 2011). Our contributions

thus consist in an industrial corpus creation and NER performance

improvements.

P05 - Machine Translation (1)Wednesday, May 25, 14:45

Chairperson: Martin Volk Poster Session

IRIS: English-Irish Machine Translation SystemMihael Arcan, Caoilfhionn Lane, Eoin Ó Droighneáin andPaul Buitelaar

We describe IRIS, a statistical machine translation (SMT) system

for translating from English into Irish and vice versa. Since Irish

is considered an under-resourced language with a limited amount

of machine-readable text, building a machine translation system

that produces reasonable translations is rather challenging. As

translation is a difficult task, current research in SMT focuses

on obtaining statistics either from a large amount of parallel,

monolingual or other multilingual resources. Nevertheless, we

collected available English-Irish data and developed an SMT

system aimed at supporting human translators and enabling cross-

lingual language technology tasks.

Linguistically Inspired Language ModelAugmentation for MTGeorge Tambouratzis and Vasiliki Pouli

The present article reports on efforts to improve the translation

accuracy of a corpus–based Machine Translation (MT) system.

In order to achieve that, an error analysis performed on past

translation outputs has indicated the likelihood of improving the

translation accuracy by augmenting the coverage of the Target-

Language (TL) side language model. The method adopted for

improving the language model is initially presented, based on the

concatenation of consecutive phrases. The algorithmic steps are

then described that form the process for augmenting the language

model. The key idea is to only augment the language model to

cover the most frequent cases of phrase sequences, as counted

over a TL-side corpus, in order to maximize the cases covered

by the new language model entries. Experiments presented in the

article show that substantial improvements in translation accuracy

are achieved via the proposed method, when integrating the grown

language model to the corpus-based MT system.

A Rule-based Shallow-transfer MachineTranslation System for Scots and EnglishGavin Abercrombie

An open-source rule-based machine translation system is

developed for Scots, a low-resourced minor language closely

related to English and spoken in Scotland and Ireland. By

20

concentrating on translation for assimilation (gist comprehension)

from Scots to English, it is proposed that the development of

dictionaries designed to be used with in the Apertium platform

will be sufficient to produce translations that improve non-Scots

speakers understanding of the language. Mono- and bilingual

Scots dictionaries are constructed using lexical items gathered

from a variety of resources across several domains. Although

the primary goal of this project is translation for gisting, the

system is evaluated for both assimilation and dissemination

(publication-ready translations). A variety of evaluation methods

are used, including a cloze test undertaken by human volunteers.

While evaluation results are comparable to, and in some cases

superior to, those of other language pairs within the Apertium

platform, room for improvement is identified in several areas of

the system.

Syntax-based Multi-system Machine Translation

Matiss Rikters and Inguna Skadina

This paper describes a hybrid machine translation system that

explores a parser to acquire syntactic chunks of a source sentence,

translates the chunks with multiple online machine translation

(MT) system application program interfaces (APIs) and creates

output by combining translated chunks to obtain the best possible

translation. The selection of the best translation hypothesis

is performed by calculating the perplexity for each translated

chunk. The goal of this approach is to enhance the baseline

multi-system hybrid translation (MHyT) system that uses only

a language model to select best translation from translations

obtained with different APIs and to improve overall English –

Latvian machine translation quality over each of the individual

MT APIs. The presented syntax-based multi-system translation

(SyMHyT) system demonstrates an improvement in terms of

BLEU and NIST scores compared to the baseline system.

Improvements reach from 1.74 up to 2.54 BLEU points.

Use of Domain-Specific Language Resources inMachine Translation

Sanja Štajner, Andreia Querido, Nuno Rendeiro, JoãoAntónio Rodrigues and António Branco

In this paper, we address the problem of Machine Translation

(MT) for a specialised domain in a language pair for which only

a very small domain-specific parallel corpus is available. We

conduct a series of experiments using a purely phrase-based SMT

(PBSMT) system and a hybrid MT system (TectoMT), testing

three different strategies to overcome the problem of the small

amount of in-domain training data. Our results show that adding

a small size in-domain bilingual terminology to the small in-

domain training corpus leads to the best improvements of a hybrid

MT system, while the PBSMT system achieves the best results

by adding a combination of in-domain bilingual terminology

and a larger out-of-domain corpus. We focus on qualitative

human evaluation of the output of two best systems (one for

each approach) and perform a systematic in-depth error analysis

which revealed advantages of the hybrid MT system over the pure

PBSMT system for this specific task.

CATaLog Online: Porting a Post-editing Tool tothe Web

Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, TapasNayak, Mihaela Vela and Josef van Genabith

This paper presents CATaLog online, a new web-based MT and

TM post-editing tool. CATaLog online is a freeware software

that can be used through a web browser and it requires only

a simple registration. The tool features a number of editing

and log functions similar to the desktop version of CATaLog

enhanced with several new features that we describe in detail in

this paper. CATaLog online is designed to allow users to post-edit

both translation memory segments as well as machine translation

output. The tool provides a complete set of log information

currently not available in most commercial CAT tools. Log

information can be used both for project management purposes

as well as for the study of the translation process and translator’s

productivity.

The ILMT-s2s Corpus – A MultimodalInterlingual Map Task Corpus

Akira Hayakawa, Saturnino Luz, Loredana Cerrato andNick Campbell

This paper presents the multimodal Interlingual Map Task Corpus

(ILMT-s2s corpus) collected at Trinity College Dublin, and

discuss some of the issues related to the collection and analysis

of the data. The corpus design is inspired by the HCRC Map Task

Corpus which was initially designed to support the investigation

of linguistic phenomena, and has been the focus of a variety of

studies of communicative behaviour. The simplicity of the task,

and the complexity of phenomena it can elicit, make the map

task an ideal object of study. Although there are studies that

used replications of the map task to investigate communication

in computer mediated tasks, this ILMT-s2s corpus is, to the

best of our knowledge, the first investigation of communicative

behaviour in the presence of three additional “filters”: Automatic

Speech Recognition (ASR), Machine Translation (MT) and Text

To Speech (TTS) synthesis, where the instruction giver and the

instruction follower speak different languages. This paper details

21

the data collection setup and completed annotation of the ILMT-

s2s corpus, and outlines preliminary results obtained from the

data.

Name Translation based on Fine-grained NamedEntity Recognition in a Single Language

Kugatsu Sadamitsu, Itsumi Saito, Taichi Katayama, HisakoAsano and Yoshihiro Matsuo

We propose named entity abstraction methods with fine-grained

named entity labels for improving statistical machine translation

(SMT). The methods are based on a bilingual named entity

recognizer that uses a monolingual named entity recognizer

with transliteration. Through experiments, we demonstrate that

incorporating fine-grained named entities into statistical machine

translation improves the accuracy of SMT with more adequate

granularity compared with the standard SMT, which is a non-

named entity abstraction method.

Lexical Resources to Enrich English MalayalamMachine Translation

Sreelekha S and Pushpak Bhattacharyya

In this paper we present our work on the usage of lexical resources

for the Machine Translation English and Malayalam. We describe

a comparative performance between different Statistical Machine

Translation (SMT) systems on top of phrase based SMT system

as baseline. We explore different ways of utilizing lexical

resources to improve the quality of English Malayalam statistical

machine translation. In order to enrich the training corpus we

have augmented the lexical resources in two ways (a) additional

vocabulary and (b) inflected verbal forms. Lexical resources

include IndoWordnet semantic relation set, lexical words and

verb phrases etc. We have described case studies, evaluations

and have given detailed error analysis for both Malayalam to

English and English to Malayalam machine translation systems.

We observed significant improvement in evaluations of translation

quality. Lexical resources do help uplift performance when

parallel corpora are scanty.

Novel elicitation and annotation schemes forsentential and sub-sentential alignments of bitexts

Yong Xu and François Yvon

Resources for evaluating sentence-level and word-level alignment

algorithms are unsatisfactory. Regarding sentence alignments, the

existing data is too scarce, especially when it comes to difficult

bitexts, containing instances of non-literal translations. Regarding

word-level alignments, most available hand-aligned data provide a

complete annotation at the level of words that is difficult to exploit,

for lack of a clear semantics for alignment links. In this study,

we propose new methodologies for collecting human judgements

on alignment links, which have been used to annotate 4 new data

sets, at the sentence and at the word level. These will be released

online, with the hope that they will prove useful to evaluate

alignment software and quality estimation tools for automatic

alignment. Keywords: Parallel corpora, Sentence Alignments,

Word Alignments, Confidence Estimation

PROTEST: A Test Suite for Evaluating Pronounsin Machine Translation

Liane Guillou and Christian Hardmeier

We present PROTEST, a test suite for the evaluation of pronoun

translation by MT systems. The test suite comprises 250 hand-

selected pronoun tokens and an automatic evaluation method

which compares the translations of pronouns in MT output with

those in the reference translation. Pronoun translations that do not

match the reference are referred for manual evaluation. PROTEST

is designed to support analysis of system performance at the level

of individual pronoun groups, rather than to provide a single

aggregate measure over all pronouns. We wish to encourage

detailed analyses to highlight issues in the handling of specific

linguistic mechanisms by MT systems, thereby contributing to

a better understanding of those problems involved in translating

pronouns. We present two use cases for PROTEST: a) for

measuring improvement/degradation of an incremental system

change, and b) for comparing the performance of a group of

systems whose design may be largely unrelated. Following the

latter use case, we demonstrate the application of PROTEST to the

evaluation of the systems submitted to the DiscoMT 2015 shared

task on pronoun translation.

Paraphrasing Out-of-Vocabulary Words withWord Embeddings and Semantic Lexicons forLow Resource Statistical Machine Translation

Chenhui Chu and Sadao Kurohashi

Out-of-vocabulary (OOV) word is a crucial problem in

statistical machine translation (SMT) with low resources. OOV

paraphrasing that augments the translation model for the OOV

words by using the translation knowledge of their paraphrases has

been proposed to address the OOV problem. In this paper, we

propose using word embeddings and semantic lexicons for OOV

paraphrasing. Experiments conducted on a low resource setting of

the OLYMPICS task of IWSLT 2012 verify the effectiveness of

our proposed method.

22

P06 - ParsingWednesday, May 25, 14:45

Chairperson: Giuseppe Attardi Poster Session

The Denoised Web Treebank: EvaluatingDependency Parsing under Noisy InputConditions

Joachim Daiber and Rob van der Goot

We introduce the Denoised Web Treebank: a treebank including

a normalization layer and a corresponding evaluation metric

for dependency parsing of noisy text, such as Tweets. This

benchmark enables the evaluation of parser robustness as well as

text normalization methods, including normalization as machine

translation and unsupervised lexical normalization, directly on

syntactic trees. Experiments show that text normalization together

with a combination of domain-specific and generic part-of-speech

taggers can lead to a significant improvement in parsing accuracy

on this test set.

Punctuation Prediction for UnsegmentedTranscript Based on Word Vector

Xiaoyin Che, Cheng Wang, Haojin Yang and ChristophMeinel

In this paper we propose an approach to predict punctuation marks

for unsegmented speech transcript. The approach is purely lexical,

with pre-trained Word Vectors as the only input. A training model

of Deep Neural Network (DNN) or Convolutional Neural Network

(CNN) is applied to classify whether a punctuation mark should be

inserted after the third word of a 5-words sequence and which kind

of punctuation mark the inserted one should be. TED talks within

IWSLT dataset are used in both training and evaluation phases.

The proposed approach shows its effectiveness by achieving better

result than the state-of-the-art lexical solution which works with

same type of data, especially when predicting puncuation position

only.

Evaluating a Deterministic Shift-Reduce NeuralParser for Constituent Parsing

Hao Zhou, Yue Zhang, Shujian Huang, Xin-Yu Dai andJiajun Chen

Greedy transition-based parsers are appealing for their very fast

speed, with reasonably high accuracies. In this paper, we build

a fast shift-reduce neural constituent parser by using a neural

network to make local decisions. One challenge to the parsing

speed is the large hidden and output layer sizes caused by the

number of constituent labels and branching options. We speed

up the parser by using a hierarchical output layer, inspired by the

hierarchical log-bilinear neural language model. In standard WSJ

experiments, the neural parser achieves an almost 2.4 time speed

up (320 sen/sec) compared to a non-hierarchical baseline without

significant accuracy loss (89.06 vs 89.13 F-score).

Language Resource Addition Strategies for RawText ParsingAtsushi Ushiku, Tetsuro Sasada and Shinsuke Mori

We focus on the improvement of accuracy of raw text parsing,

from the viewpoint of language resource addition. In Japanese, the

raw text parsing is divided into three steps: word segmentation,

part-of-speech tagging, and dependency parsing. We investigate

the contribution of language resource addition in each of three

steps to the improvement in accuracy for two domain corpora. The

experimental results show that this improvement depends on the

target domain. For example, when we handle well-written texts of

limited vocabulary, white paper, an effective language resource is

a word-POS pair sequence corpus for the parsing accuracy. So we

conclude that it is important to check out the characteristics of the

target domain and to choose a suitable language resource addition

strategy for the parsing accuracy improvement.

E-TIPSY: Search Query Corpus Annotated withEntities, Term Importance, POS Tags, andSyntactic ParsesYuval Marton and Kristina Toutanova

We present E-TIPSY, a search query corpus annotated with named

Entities, Term Importance, POS tags, and SYntactic parses. This

corpus contains crowdsourced (gold) annotations of the three

most important terms in each query. In addition, it contains

automatically produced annotations of named entities, part-of-

speech tags, and syntactic parses for the same queries. This

corpus comes in two formats: (1) Sober Subset: annotations that

two or more crowd workers agreed upon, and (2) Full Glass: all

annotations. We analyze the strikingly low correlation between

term importance and syntactic headedness, which invites research

into effective ways of combining these different signals. Our

corpus can serve as a benchmark for term importance methods

aimed at improving search engine quality and as an initial step

toward developing a dataset of gold linguistic analysis of web

search queries. In addition, it can be used as a basis for linguistic

inquiries into the kind of expressions used in search.

AfriBooms: An Online Treebank for AfrikaansLiesbeth Augustinus, Peter Dirix, Daniel Van Niekerk, InekeSchuurman, Vincent Vandeghinste, Frank Van Eynde andGerhard Van Huyssteen

Compared to well-resourced languages such as English and

Dutch, natural language processing (NLP) tools for Afrikaans are

still not abundant. In the context of the AfriBooms project, KU

23

Leuven and the North-West University collaborated to develop

a first, small treebank, a dependency parser, and an easy to

use online linguistic search engine for Afrikaans for use by

researchers and students in the humanities and social sciences.

The search tool is based on a similar development for Dutch,

i.e. GrETEL, a user-friendly search engine which allows users to

query a treebank by means of a natural language example instead

of a formal search instruction.

Differentia compositionem facit. A Slower-Pacedand Reliable Parser for Latin

Edoardo Maria Ponti and Marco Passarotti

The Index Thomisticus Treebank is the largest available

treebank for Latin; it contains Medieval Latin texts by Thomas

Aquinas. After experimenting on its data with a number

of dependency parsers based on different supervised machine

learning techniques, we found that DeSR with a multilayer

perceptron algorithm, a right-to-left transition, and a tailor-

made feature model is the parser providing the highest accuracy

rates. We improved the results further by using a technique

that combines the output parses of DeSR with those provided

by other parsers, outperforming the previous state of the art in

parsing the Index Thomisticus Treebank. The key idea behind

such improvement is to ensure a sufficient diversity and accuracy

of the outputs to be combined; for this reason, we performed an

in-depth evaluation of the results provided by the different parsers

that we combined. Finally, we assessed that, although the general

architecture of the parser is portable to Classical Latin, yet the

model trained on Medieval Latin is inadequate for such purpose.

South African Language Resources: PhraseChunking

Roald Eiselen

Phrase chunking remains an important natural language

processing (NLP) technique for intermediate syntactic processing.

This paper describes the development of protocols, annotated

phrase chunking data sets and automatic phrase chunkers for

ten South African languages. Various problems with adapting

the existing annotation protocols of English are discussed as

well as an overview of the annotated data sets. Based on the

annotated sets, CRF-based phrase chunkers are created and tested

with a combination of different features, including part of speech

tags and character n-grams. The results of the phrase chunking

evaluation show that disjunctively written languages can achieve

notably better results for phrase chunking with a limited data

set than conjunctive languages, but that the addition of character

n-grams improve the results for conjunctive languages.

Neural Scoring Function for MST Parser

Jindrich Libovický

Continuous word representations appeared to be a useful feature

in many natural language processing tasks. Using fixed-dimension

pre-trained word embeddings allows avoiding sparse bag-of-

words representation and to train models with fewer parameters.

In this paper, we use fixed pre-trained word embeddings as

additional features for a neural scoring function in the MST parser.

With the multi-layer architecture of the scoring function we can

avoid handcrafting feature conjunctions. The continuous word

representations on the input also allow us to reduce the number of

lexical features, make the parser more robust to out-of-vocabulary

words, and reduce the total number of parameters of the model.

Although its accuracy stays below the state of the art, the model

size is substantially smaller than with the standard features set.

Moreover, it performs well for languages where only a smaller

treebank is available and the results promise to be useful in cross-

lingual parsing.

Analysing Constraint Grammars with aSAT-solver

Inari Listenmaa and Koen Claessen

We describe a method for analysing Constraint Grammars (CG)

that can detect internal conflicts and redundancies in a given

grammar, without the need for a corpus. The aim is for grammar

writers to be able to automatically diagnose, and then manually

improve their grammars. Our method works by translating the

given grammar into logical constraints that are analysed by a SAT-

solver. We have evaluated our analysis on a number of non-trivial

grammars and found inconsistencies.

Old French Dependency Parsing: Results of TwoParsers Analysed from a Linguistic Point of View

Achim Stein

The treatment of medieval texts is a particular challenge for

parsers. I compare how two dependency parsers, one graph-based,

the other transition-based, perform on Old French, facing some

typical problems of medieval texts: graphical variation, relatively

free word order, and syntactic variation of several parameters

over a diachronic period of about 300 years. Both parsers were

trained and evaluated on the “Syntactic Reference Corpus of

Medieval French” (SRCMF), a manually annotated dependency

treebank. I discuss the relation between types of parsers and types

24

of language, as well as the differences of the analyses from a

linguistic point of view.

Semi-automatic Parsing for Web KnowledgeExtraction through Semantic Annotation

Maria Pia di Buono

Parsing Web information, namely parsing content to find relevant

documents on the basis of a user’s query, represents a crucial

step to guarantee fast and accurate Information Retrieval (IR).

Generally, an automated approach to such task is considered faster

and cheaper than manual systems. Nevertheless, results do not

seem have a high level of accuracy, indeed, as also Hjorland

(2007) states, using stochastic algorithms entails: • Low precision

due to the indexing of common Atomic Linguistic Units (ALUs)

or sentences. • Low recall caused by the presence of synonyms.

• Generic results arising from the use of too broad or too narrow

terms. Usually IR systems are based on invert text index, namely

an index data structure storing a mapping from content to its

locations in a database file, or in a document or a set of documents.

In this paper we propose a system, by means of which we will

develop a search engine able to process online documents, starting

from a natural language query, and to return information to users.

The proposed approach, based on the Lexicon-Grammar (LG)

framework and its language formalization methodologies, aims at

integrating a semantic annotation process for both query analysis

and document retrieval.

P07 - Speech Corpora and Databases (1)Wednesday, May 25, 14:45

Chairperson: Carmen García Mateo Poster Session

Towards Automatic Transcription of ILSE – anInterdisciplinary Longitudinal Study of AdultDevelopment and Aging

Jochen Weiner, Claudia Frankenberg, Dominic Telaar,Britta Wendelstein, Johannes Schröder and Tanja Schultz

The Interdisciplinary Longitudinal Study on Adult Development

and Aging (ILSE) was created to facilitate the study of challenges

posed by rapidly aging societies in developed countries such

as Germany. ILSE contains over 8,000 hours of biographic

interviews recorded from more than 1,000 participants over the

course of 20 years. Investigations on various aspects of aging,

such as cognitive decline, often rely on the analysis of linguistic

features which can be derived from spoken content like these

interviews. However, transcribing speech is a time and cost

consuming manual process and so far only 380 hours of ILSE

interviews have been transcribed. Thus, it is the aim of our work

to establish technical systems to fully automatically transcribe

the ILSE interview data. The joint occurrence of poor recording

quality, long audio segments, erroneous transcriptions, varying

speaking styles & crosstalk, and emotional & dialectal speech

in these interviews presents challenges for automatic speech

recognition (ASR). We describe our ongoing work towards the

fully automatic transcription of all ILSE interviews and the

steps we implemented in preparing the transcriptions to meet the

interviews’ challenges. Using a recursive long audio alignment

procedure 96 hours of the transcribed data have been made

accessible for ASR training.

FABIOLE, a Speech Database for ForensicSpeaker Comparison

Moez Ajili, Jean-françois Bonastre, Juliette Kahn, SolangeRossato and Guillaume Bernard

A speech database has been collected for use to highlight the

importance of “speaker factor” in forensic voice comparison.

FABIOLE has been created during the FABIOLE project funded

by the French Research Agency (ANR) from 2013 to 2016. This

corpus consists in more than 3 thousands excerpts spoken by 130

French native male speakers. The speakers are divided into two

categories: 30 target speakers who everyone has 100 excerpts

and 100 “impostors” who everyone has only one excerpt. The

data were collected from 10 different French radio and television

shows where each utterance turns with a minimum duration of

30s and has a good speech quality. The data set is mainly used

for investigating speaker factor in forensic voice comparison and

interpreting some unsolved issue such as the relationship between

speaker characteristics and system behavior. In this paper, we

present FABIOLE database. Then, preliminary experiments are

performed to evaluate the effect of the “speaker factor” and the

show on a voice comparison system behavior.

Phonetic Inventory for an Arabic Speech Corpus

Nawar Halabi and Mike Wald

Corpus design for speech synthesis is a well-researched topic in

languages such as English compared to Modern Standard Arabic,

and there is a tendency to focus on methods to automatically

generate the orthographic transcript to be recorded (usually greedy

methods). In this work, a study of Modern Standard Arabic

(MSA) phonetics and phonology is conducted in order to create

criteria for a greedy method to create a speech corpus transcript

for recording. The size of the dataset is reduced a number of

times using these optimisation methods with different parameters

to yield a much smaller dataset with identical phonetic coverage

than before the reduction, and this output transcript is chosen for

25

recording. This is part of a larger work to create a completely

annotated and segmented speech corpus for MSA.

AIMU: Actionable Items for MeetingUnderstandingYun-Nung Chen and Dilek Hakkani-Tur

With emerging conversational data, automated content analysis

is needed for better data interpretation, so that it is accurately

understood and can be effectively integrated and utilized in

various applications. ICSI meeting corpus is a publicly released

data set of multi-party meetings in an organization that has been

released over a decade ago, and has been fostering meeting

understanding research since then. The original data collection

includes transcription of participant turns as well as meta-data

annotations, such as disfluencies and dialog act tags. This paper

presents an extended set of annotations for the ICSI meeting

corpus with a goal of deeply understanding meeting conversations,

where participant turns are annotated by actionable items that

could be performed by an automated meeting assistant. In addition

to the user utterances that contain an actionable item, annotations

also include the arguments associated with the actionable item.

The set of actionable items are determined by aligning human-

human interactions to human-machine interactions, where a

data annotation schema designed for a virtual personal assistant

(human-machine genre) is adapted to the meetings domain

(human-human genre). The data set is formed by annotating

participants’ utterances in meetings with potential intents/actions

considering their contexts. The set of actions target what could be

accomplished by an automated meeting assistant, such as taking

a note of action items that a participant commits to, or finding

emails or topic related documents that were mentioned during the

meeting. A total of 10 defined intents/actions are considered as

actionable items in meetings. Turns that include actionable intents

were annotated for 22 public ICSI meetings, that include a total of

21K utterances, segmented by speaker turns. Participants’ spoken

turns, possible actions along with associated arguments and

their vector representations as computed by convolutional deep

structured semantic models are included in the data set for future

research. We present a detailed statistical analysis of the data

set and analyze the performance of applying convolutional deep

structured semantic models for an actionable item detection task.

The data is available at http://research.microsoft.

com/projects/meetingunderstanding/.

A Taxonomy of Specific Problem Classes inText-to-Speech Synthesis: ComparingCommercial and Open Source PerformanceFelix Burkhardt and Uwe D. Reichel

Current state-of-the-art speech synthesizers for domain-

independent systems still struggle with the challenge of generating

understandable and natural-sounding speech. This is mainly

because the pronunciation of words of foreign origin, inflections

and compound words often cannot be handled by rules.

Furthermore there are too many of these for inclusion in exception

dictionaries. We describe an approach to evaluating text-to-speech

synthesizers with a subjective listening experiment. The focus

is to differentiate between known problem classes for speech

synthesizers. The target language is German but we believe that

many of the described phenomena are not language specific. We

distinguish the following problem categories: Normalization,

Foreign linguistics, Natural writing, Language specific and

General. Each of them is divided into five to three problem

classes. Word lists for each of the above mentioned categories

were compiled and synthesized by both a commercial and an

open source synthesizer, both being based on the non-uniform

unit-selection approach. The synthesized speech was evaluated by

human judges using the Speechalyzer toolkit and the results are

discussed. It shows that, as expected, the commercial synthesizer

performs much better than the open-source one, and especially

words of foreign origin were pronounced badly by both systems.

A Comparative Analysis of Crowdsourced NaturalLanguage Corpora for Spoken Dialog Systems

Patricia Braunger, Hansjörg Hofmann, Steffen Werner andMaria Schmidt

Recent spoken dialog systems have been able to recognize freely

spoken user input in restricted domains thanks to statistical

methods in the automatic speech recognition. These methods

require a high number of natural language utterances to train

the speech recognition engine and to assess the quality of the

system. Since human speech offers many variants associated

with a single intent, a high number of user utterances have to

be elicited. Developers are therefore turning to crowdsourcing to

collect this data. This paper compares three different methods to

elicit multiple utterances for given semantics via crowd sourcing,

namely with pictures, with text and with semantic entities.

Specifically, we compare the methods with regard to the number

of valid data and linguistic variance, whereby a quantitative and

qualitative approach is proposed. In our study, the method with

text led to a high variance in the utterances and a relatively low

rate of invalid data.

A Singing Voice Database in Basque for StatisticalSinging Synthesis of Bertsolaritza

Xabier Sarasola, Eva Navas, David Tavarez, Daniel Erro,Ibon Saratxaga and Inma Hernaez

This paper describes the characteristics and structure of a Basque

singing voice database of bertsolaritza. Bertsolaritza is a popular

26

singing style from Basque Country sung exclusively in Basque

that is improvised and a capella. The database is designed to

be used in statistical singing voice synthesis for bertsolaritza

style. Starting from the recordings and transcriptions of numerous

singers, diarization and phoneme alignment experiments have

been made to extract the singing voice from the recordings and

create phoneme alignments. This labelling processes have been

performed applying standard speech processing techniques and

the results prove that these techniques can be used in this specific

singing style.

AMISCO: The Austrian German Multi-SensorCorpus

Hannes Pessentheiner, Thomas Pichler and MartinHagmüller

We introduce a unique, comprehensive Austrian German multi-

sensor corpus with moving and non-moving speakers to facilitate

the evaluation of estimators and detectors that jointly detect

a speaker’s spatial and temporal parameters. The corpus is

suitable for various machine learning and signal processing

tasks, linguistic studies, and studies related to a speaker’s

fundamental frequency (due to recorded glottograms). Available

corpora are limited to (synthetically generated/spatialized) speech

data or recordings of musical instruments that lack moving

speakers, glottograms, and/or multi-channel distant speech

recordings. That is why we recorded 24 spatially non-moving

and moving speakers, balanced male and female, to set up a

two-room and 43-channel Austrian German multi-sensor speech

corpus. It contains 8.2 hours of read speech based on

phonetically balanced sentences, commands, and digits. The

orthographic transcriptions include around 53,000 word tokens

and 2,070 word types. Special features of this corpus are

the laryngograph recordings (representing glottograms required

to detect a speaker’s instantaneous fundamental frequency

and pitch), corresponding clean-speech recordings, and spatial

information and video data provided by four Kinects and a

camera.

A Database of Laryngeal High-Speed Videos withSimultaneous High-Quality Audio Recordings ofPathological and Non-Pathological Voices

Philipp Aichinger, Immer Roesner, Matthias Leonhard,Doris-Maria Denk-Linnert, Wolfgang Bigenzahn and BeritSchneider-Stickler

Auditory voice quality judgements are used intensively for

the clinical assessment of pathological voice. Voice quality

concepts are fuzzily defined and poorly standardized however,

which hinders scientific and clinical communication. The

described database documents a wide variety of pathologies

and is used to investigate auditory voice quality concepts with

regard to phonation mechanisms. The database contains 375

laryngeal high-speed videos and simultaneous high-quality audio

recordings of sustained phonations of 80 pathological and 40

non-pathological subjects. Interval wise annotations regarding

video and audio quality, as well as voice quality ratings are

provided. Video quality is annotated for the visibility of

anatomical structures and artefacts such as blurring or reduced

contrast. Voice quality annotations include ratings on the presence

of dysphonia and diplophonia. The purpose of the database is

to aid the formulation of observationally well-founded models

of phonation and the development of model-based automatic

detectors for distinct types of phonation, especially for clinically

relevant nonmodal voice phenomena. Another application is

the training of audio-based fundamental frequency extractors on

video-based reference fundamental frequencies.

BulPhonC: Bulgarian Speech Corpus for theDevelopment of ASR TechnologyNeli Hateva, Petar Mitankin and Stoyan Mihov

In this paper we introduce a Bulgarian speech database, which

was created for the purpose of ASR technology development. The

paper describes the design and the content of the speech database.

We present also an empirical evaluation of the performance of a

LVCSR system for Bulgarian trained on the BulPhonC data. The

resource is available free for scientific usage.

Designing a Speech Corpus for the Developmentand Evaluation of Dictation Systems in LatvianMarcis Pinnis, Askars Salimbajevs and Ilze Auzina

In this paper the authors present a speech corpus designed

and created for the development and evaluation of dictation

systems in Latvian. The corpus consists of over nine hours of

orthographically annotated speech from 30 different speakers.

The corpus features spoken commands that are common for

dictation systems for text editors. The corpus is evaluated in an

automatic speech recognition scenario. Evaluation results in an

ASR dictation scenario show that the addition of the corpus to the

acoustic model training data in combination with language model

adaptation allows to decrease the WER by up to relative 41.36%

(or 16.83% in absolute numbers) compared to a baseline system

without language model adaptation. Contribution of acoustic data

augmentation is at relative 12.57% (or 3.43% absolute).

The LetsRead Corpus of Portuguese ChildrenReading Aloud for Performance EvaluationJorge Proença, Dirce Celorico, Sara Candeias, CarlaLopes and Fernando Perdigão

This paper introduces the LetsRead Corpus of European

Portuguese read speech from 6 to 10 years old children.

27

The motivation for the creation of this corpus stems from

the inexistence of databases with recordings of reading tasks

of Portuguese children with different performance levels and

including all the common reading aloud disfluencies. It is

also essential to develop techniques to fulfill the main objective

of the LetsRead project: to automatically evaluate the reading

performance of children through the analysis of reading tasks.

The collected data amounts to 20 hours of speech from 284

children from private and public Portuguese schools, with each

child carrying out two tasks: reading sentences and reading a list

of pseudowords, both with varying levels of difficulty throughout

the school grades. In this paper, the design of the reading

tasks presented to children is described, as well as the collection

procedure. Manually annotated data is analyzed according to

disfluencies and reading performance. The considered word

difficulty parameter is also confirmed to be suitable for the

pseudoword reading tasks.

The BAS Speech Data Repository

Uwe Reichel, Florian Schiel, Thomas Kisler, ChristophDraxler and Nina Pörner

The BAS CLARIN speech data repository is introduced. At the

current state it comprises 31 pre-dominantly German corpora of

spoken language. It is compliant to the CLARIN-D as well

as the OLAC requirements. This enables its embedding into

several infrastructures. We give an overview over its structure,

its implementation as well as the corpora it contains.

A Dutch Dysarthric Speech Database forIndividualized Speech Therapy Research

Emre Yilmaz, Mario Ganzeboom, Lilian Beijer, CatiaCucchiarini and Helmer Strik

We present a new Dutch dysarthric speech database containing

utterances of neurological patients with Parkinson’s disease,

traumatic brain injury and cerebrovascular accident. The speech

content is phonetically and linguistically diversified by using

numerous structured sentence and word lists. Containing

more than 6 hours of mildly to moderately dysarthric speech,

this database can be used for research on dysarthria and

for developing and testing speech-to-text systems designed for

medical applications. Current activities aimed at extending this

database are also discussed.

P08 - SummarisationWednesday, May 25, 14:45

Chairperson: Gerard de Melo Poster Session

Urdu Summary Corpus

Muhammad Humayoun, Rao Muhammad Adeel Nawab,Muhammad Uzair, Saba Aslam and Omer Farzand

Language resources, such as corpora, are important for various

natural language processing tasks. Urdu has millions of speakers

around the world but it is under-resourced in terms of standard

evaluation resources. This paper reports the construction of a

benchmark corpus for Urdu summaries (abstracts) to facilitate the

development and evaluation of single document summarization

systems for Urdu language. In Urdu, space does not always

mark word boundary. Therefore, we created two versions of the

same corpus. In the first version, words are separated by space.

In contrast, proper word boundaries are manually tagged in the

second version. We further apply normalization, part-of-speech

tagging, morphological analysis, lemmatization, and stemming

for the articles and their summaries in both versions. In order to

apply these annotations, we re-implemented some NLP tools for

Urdu. We provide Urdu Summary Corpus, all these annotations

and the needed software tools (as open-source) for researchers

to run experiments and to evaluate their work including but not

limited to single-document summarization task.

A Publicly Available Indonesian Corpora forAutomatic Abstractive and Extractive ChatSummarization

Fajri Koto

In this paper we report our effort to construct the first ever

Indonesian corpora for chat summarization. Specifically, we

utilized documents of multi-participant chat from a well known

online instant messaging application, WhatsApp. We construct

the gold standard by asking three native speakers to manually

summarize 300 chat sections (152 of them contain images).

As result, three reference summaries in extractive and either

abstractive form are produced for each chat sections. The

corpus is still in its early stage of investigation, yielding exciting

possibilities of future works.

Revisiting Summarization Evaluation forScientific Articles

Arman Cohan and Nazli Goharian

Evaluation of text summarization approaches have been mostly

based on metrics that measure similarities of system generated

summaries with a set of human written gold-standard summaries.

28

The most widely used metric in summarization evaluation has

been the ROUGE family. ROUGE solely relies on lexical overlaps

between the terms and phrases in the sentences; therefore, in

cases of terminology variations and paraphrasing, ROUGE is

not as effective. Scientific article summarization is one such

case that is different from general domain summarization (e.g.

newswire data). We provide an extensive analysis of ROUGE’s

effectiveness as an evaluation metric for scientific summarization;

we show that, contrary to the common belief, ROUGE is not

much reliable in evaluating scientific summaries. We furthermore

show how different variants of ROUGE result in very different

correlations with the manual Pyramid scores. Finally, we propose

an alternative metric for summarization evaluation which is based

on the content relevance between a system generated summary

and the corresponding human written summaries. We call our

metric SERA (Summarization Evaluation by Relevance Analysis).

Unlike ROUGE, SERA consistently achieves high correlations

with manual scores which shows its effectiveness in evaluation

of scientific article summarization.

The OnForumS corpus from the Shared Task onOnline Forum Summarisation at MultiLing 2015

Mijail Kabadjov, Udo Kruschwitz, Massimo Poesio, JosefSteinberger, Jorge Valderrama and Hugo Zaragoza

In this paper we present the OnForumS corpus developed for the

shared task of the same name on Online Forum Summarisation

(OnForumS at MultiLing’15). The corpus consists of a set

of news articles with associated readers’ comments from The

Guardian (English) and La Repubblica (Italian). It comes

with four levels of annotation: argument structure, comment-

article linking, sentiment and coreference. The former three

were produced through crowdsourcing, whereas the latter, by an

experienced annotator using a mature annotation scheme. Given

its annotation breadth, we believe the corpus will prove a useful

resource in stimulating and furthering research in the areas of

Argumentation Mining, Summarisation, Sentiment, Coreference

and the interlinks therein.

P09 - Word Sense Disambiguation (1)Wednesday, May 25, 14:45

Chairperson: Luca Dini Poster Session

Automatic Enrichment of WordNet withCommon-Sense Knowledge

Luigi Di Caro and Guido Boella

WordNet represents a cornerstone in the Computational

Linguistics field, linking words to meanings (or senses) through a

taxonomical representation of synsets, i.e., clusters of words with

an equivalent meaning in a specific context often described by

few definitions (or glosses) and examples. Most of the approaches

to the Word Sense Disambiguation task fully rely on these short

texts as a source of contextual information to match with the

input text to disambiguate. This paper presents the first attempt to

enrich synsets data with common-sense definitions, automatically

retrieved from ConceptNet 5, and disambiguated accordingly to

WordNet. The aim was to exploit the shared- and immediate-

thinking nature of common-sense knowledge to extend the short

but incredibly useful contextual information of the synsets. A

manual evaluation on a subset of the entire result (which counts

a total of almost 600K synset enrichments) shows a very high

precision with an estimated good recall.

VPS-GradeUp: Graded Decisions on UsagePatterns

Vít Baisa, Silvie Cinkova, Ema Krejcová and AnnaVernerová

We present VPS-GradeUp – a set of 11,400 graded human

decisions on usage patterns of 29 English lexical verbs from

the Pattern Dictionary of English Verbs by Patrick Hanks.

The annotation contains, for each verb lemma, a batch of 50

concordances with the given lemma as KWIC, and for each

of these concordances we provide a graded human decision

on how well the individual PDEV patterns for this particular

lemma illustrate the given concordance, indicated on a 7-point

Likert scale for each PDEV pattern. With our annotation, we

were pursuing a pilot investigation of the foundations of human

clustering and disambiguation decisions with respect to usage

patterns of verbs in context. The data set is publicly available

at http://hdl.handle.net/11234/1-1585.

Sense-annotating a Lexical Substitution Data Setwith Ubyline

Tristan Miller, Mohamed Khemakhem, Richard Eckart deCastilho and Iryna Gurevych

We describe the construction of GLASS, a newly sense-

annotated version of the German lexical substitution data set

used at the GermEval 2015: LexSub shared task. Using the

two annotation layers, we conduct the first known empirical

study of the relationship between manually applied word

senses and lexical substitutions. We find that synonymy and

hypernymy/hyponymy are the only semantic relations directly

linking targets to their substitutes, and that substitutes in the

target’s hypernymy/hyponymy taxonomy closely align with the

synonyms of a single GermaNet synset. Despite this, these

substitutes account for a minority of those provided by the

29

annotators. The results of our analysis accord with those

of a previous study on English-language data (albeit with

automatically induced word senses), leading us to suspect that the

sense–substitution relations we discovered may be of a universal

nature. We also tentatively conclude that relatively cheap lexical

substitution annotations can be used as a knowledge source for

automatic WSD. Also introduced in this paper is Ubyline, the

web application used to produce the sense annotations. Ubyline

presents an intuitive user interface optimized for annotating lexical

sample data, and is readily adaptable to sense inventories other

than GermaNet.

A Corpus of Literal and Idiomatic Uses ofGerman Infinitive-Verb Compounds

Andrea Horbach, Andrea Hensler, Sabine Krome, JakobPrange, Werner Scholze-Stubenrecht, Diana Steffen, StefanThater, Christian Wellner and Manfred Pinkal

We present an annotation study on a representative dataset of

literal and idiomatic uses of German infinitive-verb compounds

in newspaper and journal texts. Infinitive-verb compounds form

a challenge for writers of German, because spelling regulations

are different for literal and idiomatic uses. Through the

participation of expert lexicographers we were able to obtain a

high-quality corpus resource which offers itself as a testbed for

automatic idiomaticity detection and coarse-grained word-sense

disambiguation. We trained a classifier on the corpus which was

able to distinguish literal and idiomatic uses with an accuracy of

85 %.

The SemDaX Corpus – Sense Annotations withScalable Sense Inventories

Bolette Pedersen, Anna Braasch, Anders Johannsen,Héctor Martínez Alonso, Sanni Nimb, Sussi Olsen, AndersSøgaard and Nicolai Hartvig Sørensen

We launch the SemDaX corpus which is a recently completed

Danish human-annotated corpus available through a CLARIN

academic license. The corpus includes approx. 90,000 words,

comprises six textual domains, and is annotated with sense

inventories of different granularity. The aim of the developed

corpus is twofold: i) to assess the reliability of the different sense

annotation schemes for Danish measured by qualitative analyses

and annotation agreement scores, and ii) to serve as training

and test data for machine learning algorithms with the practical

purpose of developing sense taggers for Danish. To these aims,

we take a new approach to human-annotated corpus resources by

double annotating a much larger part of the corpus than what is

normally seen: for the all-words task we double annotated 60%

of the material and for the lexical sample task 100%. We include

in the corpus not only the adjucated files, but also the diverging

annotations. In other words, we consider not all disagreement

to be noise, but rather to contain valuable linguistic information

that can help us improve our annotation schemes and our learning

algorithms.

Graded and Word-Sense-DisambiguationDecisions in Corpus Pattern Analysis: a PilotStudySilvie Cinkova, Ema Krejcová, Anna Vernerová and VítBaisa

We present a pilot analysis of a new linguistic resource, VPS-

GradeUp (available at http://hdl.handle.net/11234/

1-1585). The resource contains 11,400 graded human decisions

on usage patterns of 29 English lexical verbs, randomly selected

from the Pattern Dictionary of English Verbs (Hanks, 2000

2014) based on their frequency and the number of senses their

lemmas have in PDEV. This data set has been created to observe

the interannotator agreement on PDEV patterns produced using

the Corpus Pattern Analysis (Hanks, 2013). Apart from the

graded decisions, the data set also contains traditional Word-

Sense-Disambiguation (WSD) labels. We analyze the associations

between the graded annotation and WSD annotation. The results

of the respective annotations do not correlate with the size of the

usage pattern inventory for the respective verbs lemmas, which

makes the data set worth further linguistic analysis.

Multi-prototype Chinese Character EmbeddingYanan Lu, Yue Zhang and Donghong Ji

Chinese sentences are written as sequences of characters, which

are elementary units of syntax and semantics. Characters are

highly polysemous in forming words. We present a position-

sensitive skip-gram model to learn multi-prototype Chinese

character embeddings, and explore the usefulness of such

character embeddings to Chinese NLP tasks. Evaluation on

character similarity shows that multi-prototype embeddings are

significantly better than a single-prototype baseline. In addition,

used as features in the Chinese NER task, the embeddings result

in a 1.74% F-score improvement over a state-of-the-art baseline.

A comparison of Named-Entity Disambiguationand Word Sense DisambiguationAngel Chang, Valentin I. Spitkovsky, Christopher D.Manning and Eneko Agirre

Named Entity Disambiguation (NED) is the task of linking

a named-entity mention to an instance in a knowledge-base,

typically Wikipedia-derived resources like DBpedia. This task

is closely related to word-sense disambiguation (WSD), where

the mention of an open-class word is linked to a concept in a

knowledge-base, typically WordNet. This paper analyzes the

30

relation between two annotated datasets on NED and WSD,

highlighting the commonalities and differences. We detail the

methods to construct a NED system following the WSD word-

expert approach, where we need a dictionary and one classifier

is built for each target entity mention string. Constructing a

dictionary for NED proved challenging, and although similarity

and ambiguity are higher for NED, the results are also higher

due to the larger number of training data, and the more crisp and

skewed meaning differences.

O9 - Linked DataWednesday, May 25, 16:45

Chairperson: John McCrae Oral Session

Leveraging RDF Graphs for Crossing MultipleBilingual Dictionaries

Marta Villegas, Maite Melero, Núria Bel and Jorge Gracia

The experiments presented here exploit the properties of

the Apertium RDF Graph, principally cycle density and

nodes’ degree, to automatically generate new translation

relations between words, and therefore to enrich existing

bilingual dictionaries with new entries. Currently, the

Apertium RDF Graph includes data from 22 Apertium bilingual

dictionaries and constitutes a large unified array of linked

lexical entries and translations that are available and accessible

on the Web (http://linguistic.linkeddata.es/

apertium/). In particular, its graph structure allows

for interesting exploitation opportunities, some of which are

addressed in this paper. Two ’massive’ experiments are reported:

in the first one, the original EN-ES translation set was removed

from the Apertium RDF Graph and a new EN-ES version was

generated. The results were compared against the previously

removed EN-ES data and against the Concise Oxford Spanish

Dictionary. In the second experiment, a new non-existent EN-

FR translation set was generated. In this case the results were

compared against a converted wiktionary English-French file. The

results we got are really good and perform well for the extreme

case of correlated polysemy. This lead us to address the possibility

to use cycles and nodes degree to identify potential oddities in the

source data. If cycle density proves efficient when considering

potential targets, we can assume that in dense graphs nodes with

low degree may indicate potential errors.

PreMOn: a Lemon Extension for ExposingPredicate Models as Linked Data

Francesco Corcoglioniti, Marco Rospocher, AlessioPalmero Aprosio and Sara Tonelli

We introduce PreMOn (predicate model for ontologies), a

linguistic resource for exposing predicate models (PropBank,

NomBank, VerbNet, and FrameNet) and mappings between

them (e.g, SemLink) as Linked Open Data. It consists of two

components: (i) the PreMOn Ontology, an extension of the

lemon model by the W3C Ontology-Lexica Community Group,

that enables to homogeneously represent data from the various

predicate models; and, (ii) the PreMOn Dataset, a collection of

RDF datasets integrating various versions of the aforementioned

predicate models and mapping resources. PreMOn is freely

available and accessible online in different ways, including

through a dedicated SPARQL endpoint.

Semantic Links for Portuguese

Fabricio Chalub, Livy Real, Alexandre Rademaker andValeria de Paiva

This paper describes work on incorporating Princenton’s

WordNet morphosemantics links to the fabric of the Portuguese

OpenWordNet-PT. Morphosemantic links are relations between

verbs and derivationally related nouns that are semantically typed

(such as for tune-tuner – in Portuguese “afinar-afinador” – linked

through an “agent” link). Morphosemantic links have been

discussed for Princeton’s WordNet for a while, but have not been

added to the official database. These links are very useful, they

help us to improve our Portuguese WordNet. Thus we discuss the

integration of these links in our base and the issues we encountered

with the integration.

Creating Linked Data Morphological LanguageResources with MMoOn - The Hebrew MorphemeInventory

Bettina Klimek, Natanael Arndt, Sebastian Krause andTimotheus Arndt

The development of standard models for describing general lexical

resources has led to the emergence of numerous lexical datasets

of various languages in the Semantic Web. However, equivalent

models covering the linguistic domain of morphology do not

exist. As a result, there are hardly any language resources of

morphemic data available in RDF to date. This paper presents

the creation of the Hebrew Morpheme Inventory from a manually

compiled tabular dataset comprising around 52.000 entries. It is

an ongoing effort of representing the lexemes, word-forms and

morphologigal patterns together with their underlying relations

31

based on the newly created Multilingual Morpheme Ontology

(MMoOn). It will be shown how segmented Hebrew language

data can be granularly described in a Linked Data format, thus,

serving as an exemplary case for creating morpheme inventories

of any inflectional language with MMoOn. The resulting dataset

is described a) according to the structure of the underlying data

format, b) with respect to the Hebrew language characteristic

of building word-forms directly from roots, c) by exemplifying

how inflectional information is realized and d) with regard to its

enrichment with external links to sense resources.

O10 - Multilingual CorporaWednesday, May 25, 16:45

Chairperson: Hitoshi Isahara Oral Session

Parallel Global Voices: a Collection ofMultilingual Corpora with Citizen Media Stories

Prokopis Prokopidis, Vassilis Papavassiliou and SteliosPiperidis

We present a new collection of multilingual corpora automatically

created from the content available in the Global Voices websites,

where volunteers have been posting and translating citizen media

stories since 2004. We describe how we crawled and processed

this content to generate parallel resources comprising 302.6K

document pairs and 8.36M segment alignments in 756 language

pairs. For some language pairs, the segment alignments in this

resource are the first open examples of their kind. In an initial use

of this resource, we discuss how a set of document pair detection

algorithms performs on the Greek-English corpus.

Large Multi-lingual, Multi-level and Multi-genreAnnotation Corpus

Xuansong Li, Martha Palmer, Nianwen Xue, LanceRamshaw, Mohamed Maamouri, Ann Bies, KathrynConger, Stephen Grimes and Stephanie Strassel

High accuracy for automated translation and information retrieval

calls for linguistic annotations at various language levels.

The plethora of informal internet content sparked the demand

for porting state-of-art natural language processing (NLP)

applications to new social media as well as diverse language

adaptation. Effort launched by the BOLT (Broad Operational

Language Translation) program at DARPA (Defense Advanced

Research Projects Agency) successfully addressed the internet

information with enhanced NLP systems. BOLT aims for

automated translation and linguistic analysis for informal genres

of text and speech in online and in-person communication.

As a part of this program, the Linguistic Data Consortium

(LDC) developed valuable linguistic resources in support of

the training and evaluation of such new technologies. This

paper focuses on methodologies, infrastructure, and procedure

for developing linguistic annotation at various language levels,

including Treebank (TB), word alignment (WA), PropBank (PB),

and co-reference (CoRef). Inspired by the OntoNotes approach

with adaptations to the tasks to reflect the goals and scope of the

BOLT project, this effort has introduced more annotation types of

informal and free-style genres in English, Chinese and Egyptian

Arabic. The corpus produced is by far the largest multi-lingual,

multi-level and multi-genre annotation corpus of informal text and

speech.

C4Corpus: Multilingual Web-size Corpus withFree License

Ivan Habernal, Omnia Zayed and Iryna Gurevych

Large Web corpora containing full documents with permissive

licenses are crucial for many NLP tasks. In this article we present

the construction of 12 million-pages Web corpus (over 10 billion

tokens) licensed under CreativeCommons license family in 50+

languages that has been extracted from CommonCrawl, the largest

publicly available general Web crawl to date with about 2 billion

crawled URLs. Our highly-scalable Hadoop-based framework is

able to process the full CommonCrawl corpus on 2000+ CPU

cluster on the Amazon Elastic Map/Reduce infrastructure. The

processing pipeline includes license identification, state-of-the-art

boilerplate removal, exact duplicate and near-duplicate document

removal, and language detection. The construction of the corpus

is highly configurable and fully reproducible, and we provide both

the framework (DKPro C4CorpusTools) and the resulting data

(C4Corpus) to the research community.

OpenSubtitles2016: Extracting Large ParallelCorpora from Movie and TV Subtitles

Pierre Lison and Jörg Tiedemann

We present a new major release of the OpenSubtitles collection of

parallel corpora. The release is compiled from a large database

of movie and TV subtitles and includes a total of 1689 bitexts

spanning 2.6 billion sentences across 60 languages. The release

also incorporates a number of enhancements in the preprocessing

and alignment of the subtitles, such as the automatic correction

of OCR errors and the use of meta-data to estimate the quality of

each subtitle and score subtitle pairs.

32

O11 - LexiconsWednesday, May 25, 16:45

Chairperson: Bolette Pedersen Oral Session

LexFr: Adapting the LexIt Framework to Build aCorpus-based French Subcategorization Lexicon

Giulia Rambelli, Gianluca Lebani, Laurent Prévot andAlessandro Lenci

This paper introduces LexFr, a corpus-based French lexical

resource built by adapting the framework LexIt, originally

developed to describe the combinatorial potential of Italian

predicates. As in the original framework, the behavior of a

group of target predicates is characterized by a series of syntactic

(i.e., subcategorization frames) and semantic (i.e., selectional

preferences) statistical information (a.k.a. distributional profiles)

whose extraction process is mostly unsupervised. The first

release of LexFr includes information for 2,493 verbs, 7,939

nouns and 2,628 adjectives. In these pages we describe the

adaptation process and evaluated the final resource by comparing

the information collected for 20 test verbs against the information

available in a gold standard dictionary. In the best performing

setting, we obtained 0.74 precision, 0.66 recall and 0.70 F-

measure.

Polarity Lexicon Building: to what Extent Is theManual Effort Worth?

Iñaki San Vicente and Xabier Saralegi

Polarity lexicons are a basic resource for analyzing the sentiments

and opinions expressed in texts in an automated way. This

paper explores three methods to construct polarity lexicons:

translating existing lexicons from other languages, extracting

polarity lexicons from corpora, and annotating sentiments Lexical

Knowledge Bases. Each of these methods require a different

degree of human effort. We evaluate how much manual effort is

needed and to what extent that effort pays in terms of performance

improvement. Experiment setup includes generating lexicons for

Basque, and evaluating them against gold standard datasets in

different domains. Results show that extracting polarity lexicons

from corpora is the best solution for achieving a good performance

with reasonable human effort.

Al Qamus al Muhit, a Medieval Arabic Lexicon inLMF

Ouafae Nahli, Francesca Frontini, Monica Monachini,Fahad Khan, Arsalan Zarghili and Mustapha Khalfi

This paper describes the conversion into LMF, a standard

lexicographic digital format of ’al-qams al-mui, a Medieval

Arabic lexicon. The lexicon is first described, then all the steps

required for the conversion are illustrated. The work is will

produce a useful lexicographic resource for Arabic NLP, but is

also interesting per se, to study the implications of adapting the

LMF model to the Arabic language. Some reflections are offered

as to the status of roots with respect to previously suggested

representations. In particular, roots are, in our opinion are to be

not treated as lexical entries, but modeled as lexical metadata for

classifying and identifying lexical entries. In this manner, each

root connects all entries that are derived from it.

CASSAurus: A Resource of Simpler SpanishSynonyms

Ricardo Baeza-Yates, Luz Rello and Julia Dembowski

In this work we introduce and describe a language resource

composed of lists of simpler synonyms for Spanish. The

synonyms are divided in different senses taken from the Spanish

OpenThesaurus, where context disambiguation was performed by

using statistical information from the Web and Google Books

Ngrams. This resource is freely available online and can be used

for different NLP tasks such as lexical simplification. Indeed, so

far it has been already integrated into four tools.

O12 - OCR for Historical TextWednesday, May 25, 16:45

Chairperson: Thierry Declerck Oral Session

Measuring Lexical Quality of a Historical FinnishNewspaper Collection – Analysis of Garbled OCRData with Basic Language Technology Tools andMeans

Kimmo Kettunen and Tuula Pääkkönen

The National Library of Finland has digitized a large proportion

of the historical newspapers published in Finland between 1771

and 1910 (Bremer-Laamanen 2001). This collection contains

approximately 1.95 million pages in Finnish and Swedish.

Finnish part of the collection consists of about 2.39 billion

words. The National Library’s Digital Collections are offered

via the digi.kansalliskirjasto.fi web service, also known as Digi.

Part of this material is also available freely downloadable in

The Language Bank of Finland provided by the Fin-CLARIN

consortium . The collection can also be accessed through the

Korp environment that has been developed by Språkbanken at the

University of Gothenburg and extended by FIN-CLARIN team

at the University of Helsinki to provide concordances of text

resources. A Cranfield-style information retrieval test collection

has been produced out of a small part of the Digi newspaper

material at the University of Tampere (Järvelin et al., 2015). The

33

quality of the OCRed collections is an important topic in digital

humanities, as it affects general usability and searchability of

collections. There is no single available method to assess the

quality of large collections, but different methods can be used

to approximate the quality. This paper discusses different corpus

analysis style ways to approximate the overall lexical quality of

the Finnish part of the Digi collection.

Using SMT for OCR Error Correction ofHistorical Texts

Haithem Afli, Zhengwei Qiu, Andy Way and PáraicSheridan

A trend to digitize historical paper-based archives has emerged

in recent years, with the advent of digital optical scanners. A

lot of paper-based books, textbooks, magazines, articles, and

documents are being transformed into electronic versions that

can be manipulated by a computer. For this purpose, Optical

Character Recognition (OCR) systems have been developed

to transform scanned digital text into editable computer text.

However, different kinds of errors in the OCR system output

text can be found, but Automatic Error Correction tools can help

in performing the quality of electronic texts by cleaning and

removing noises. In this paper, we perform a qualitative and

quantitative comparison of several error-correction techniques for

historical French documents. Experimentation shows that our

Machine Translation for Error Correction method is superior to

other Language Modelling correction techniques, with nearly 13%

relative improvement compared to the initial baseline.

OCR Post-Correction Evaluation of Early DutchBooks Online - Revisited

Martin Reynaert

We present further work on evaluation of the fully automatic

post-correction of Early Dutch Books Online, a collection of

10,333 18th century books. In prior work we evaluated the new

implementation of Text-Induced Corpus Clean-up (TICCL) on the

basis of a single book Gold Standard derived from this collection.

In the current paper we revisit the same collection on the basis of a

sizeable 1020 item random sample of OCR post-corrected strings

from the full collection. Both evaluations have their own stories

to tell and lessons to teach.

Crowdsourcing an OCR Gold Standard for aGerman and French Heritage Corpus

Simon Clematide, Lenz Furrer and Martin Volk

Crowdsourcing approaches for post-correction of OCR output

(Optical Character Recognition) have been successfully applied

to several historic text collections. We report on our crowd-

correction platform Kokos, which we built to improve the OCR

quality of the digitized yearbooks of the Swiss Alpine Club

(SAC) from the 19th century. This multilingual heritage corpus

consists of Alpine texts mainly written in German and French,

all typeset in Antiqua font. Finding and engaging volunteers for

correcting large amounts of pages into high quality text requires

a carefully designed user interface, an easy-to-use workflow, and

continuous efforts for keeping the participants motivated. More

than 180,000 characters on about 21,000 pages were corrected by

volunteers in about 7 month, achieving an OCR gold standard with

a systematically evaluated accuracy of 99.7% on the word level.

The crowdsourced OCR gold standard and the corresponding

original OCR recognition results from Abby FineReader 7 for

each page are available as a resource. Additionally, the scanned

images (300dpi) of all pages are included in order to facilitate tests

with other OCR software.

P10 - Discourse (1)Wednesday, May 25, 16:45

Chairperson: Elena Cabrio Poster Session

Argument Mining: the Bottleneck of Knowledgeand Language Resources

Patrick Saint-Dizier

Given a controversial issue, argument mining from natural

language texts (news papers, and any form of text on the

Internet) is extremely challenging: domain knowledge is often

required together with appropriate forms of inferences to identify

arguments. This contribution explores the types of knowledge that

are required and how they can be paired with reasoning schemes,

language processing and language resources to accurately mine

arguments. We show via corpus analysis that the Generative

Lexicon, enhanced in different manners and viewed as both a

lexicon and a domain knowledge representation, is a relevant

approach. In this paper, corpus annotation for argument mining

is first developed, then we show how the generative lexicon

approach must be adapted and how it can be paired with language

processing patterns to extract and specify the nature of arguments.

Our approach to argument mining is thus knowledge driven

From Interoperable Annotations towardsInteroperable Resources: A MultilingualApproach to the Analysis of Discourse

Ekaterina Lapshinova-Koltunski, Kerstin Anna Kunz andAnna Nedoluzhko

In the present paper, we analyse variation of discourse phenomena

in two typologically different languages, i.e. in German and

34

Czech. The novelty of our approach lies in the nature of

the resources we are using. Advantage is taken of existing

resources, which are, however, annotated on the basis of two

different frameworks. We use an interoperable scheme unifying

discourse phenomena in both frameworks into more abstract

categories and considering only those phenomena that have a

direct match in German and Czech. The discourse properties

we focus on are relations of identity, semantic similarity, ellipsis

and discourse relations. Our study shows that the application of

interoperable schemes allows an exploitation of discourse-related

phenomena analysed in different projects and on the basis of

different frameworks. As corpus compilation and annotation is

a time-consuming task, positive results of this experiment open up

new paths for contrastive linguistics, translation studies and NLP,

including machine translation.

Falling silent, lost for words ... Tracing personalinvolvement in interviews with Dutch warveterans

Henk van den Heuvel and Nelleke Oostdijk

In sources used in oral history research (such as interviews

with eye witnesses), passages where the degree of personal

emotional involvement is found to be high can be of particular

interest, as these may give insight into how historical events

were experienced, and what moral dilemmas and psychological

or religious struggles were encountered. In a pilot study involving

a large corpus of interview recordings with Dutch war veterans,

we have investigated if it is possible to develop a method for

automatically identifying those passages where the degree of

personal emotional involvement is high. The method is based

on the automatic detection of exceptionally large silences and

filled pause segments (using Automatic Speech Recognition), and

cues taken from specific n-grams. The first results appear to be

encouraging enough for further elaboration of the method.

A Bilingual Discourse Corpus and Its Applications

yang liu, Jiajun Zhang, Chengqing Zong, Yating Yang andXi Zhou

Existing discourse research only focuses on the monolingual

languages and the inconsistency between languages limits the

power of the discourse theory in multilingual applications such as

machine translation. To address this issue, we design and build a

bilingual discource corpus in which we are currently defining and

annotating the bilingual elementary discourse units (BEDUs). The

BEDUs are then organized into hierarchical structures. Using this

discourse style, we have annotated nearly 20K LDC sentences.

Finally, we design a bilingual discourse based method for machine

translation evaluation and show the effectiveness of our bilingual

discourse annotations.

Adding Semantic Relations to a Large-CoverageConnective Lexicon of GermanTatjana Scheffler and Manfred Stede

DiMLex is a lexicon of German connectives that can be used

for various language understanding purposes. We enhanced the

coverage to 275 connectives, which we regard as covering all

known German discourse connectives in current use. In this

paper, we consider the task of adding the semantic relations

that can be expressed by each connective. After discussing

different approaches to retrieving semantic information, we settle

on annotating each connective with senses from the new PDTB

3.0 sense hierarchy. We describe our new implementation in

the extended DiMLex, which will be available for research

purposes.

Corpus Resources for Dispute MediationDiscourseMathilde Janier and Chris Reed

Dispute mediation is a growing activity in the resolution of

conflicts, and more and more research emerge to enhance and

better understand this (until recently) understudied practice.

Corpus analyses are necessary to study discourse in this context;

yet, little data is available, mainly because of its confidentiality

principle. After proposing hints and avenues to acquire transcripts

of mediation sessions, this paper presents the Dispute Mediation

Corpus, which gathers annotated excerpts of mediation dialogues.

Although developed as part of a project on argumentation, it

is freely available and the text data can be used by anyone.

This first-ever open corpus of mediation interactions can be of

interest to scholars studying discourse, but also conflict resolution,

argumentation, linguistics, communication, etc. We advocate

for using and extending this resource that may be valuable to a

large variety of domains of research, particularly those striving

to enhance the study of the rapidly growing activity of dispute

mediation.

A Tagged Corpus for Automatic Labeling ofDisabilities in Medical Scientific PapersCarlos Valmaseda, Juan Martinez-Romo and LourdesAraujo

This paper presents the creation of a corpus of labeled disabilities

in scientific papers. The identification of medical concepts in

documents and, especially, the identification of disabilities, is a

complex task mainly due to the variety of expressions that can

make reference to the same problem. Currently there is not a set

of documents manually annotated with disabilities with which to

evaluate an automatic detection system of such concepts. This

is the reason why this corpus arises, aiming to facilitate the

35

evaluation of systems that implement an automatic annotation tool

for extracting biomedical concepts such as disabilities. The result

is a set of scientific papers manually annotated. For the selection

of these scientific papers has been conducted a search using a

list of rare diseases, since they generally have associated several

disabilities of different kinds.

PersonaBank: A Corpus of Personal Narrativesand Their Story Intention Graphs

Stephanie Lukin, Kevin Bowden, Casey Barackman andMarilyn Walker

We present a new corpus, PersonaBank, consisting of 108 personal

stories from weblogs that have been annotated with their Story

Intention Graphs, a deep representation of the content of a

story. We describe the topics of the stories and the basis of

the Story Intention Graph representation, as well as the process

of annotating the stories to produce the Story Intention Graphs

and the challenges of adapting the tool to this new personal

narrative domain. We also discuss how the corpus can be used in

applications that retell the story using different styles of tellings,

co-tellings, or as a content planner.

Fine-Grained Chinese Discourse RelationLabelling

Huan-Yuan Chen, Wan-Shan Liao, Hen-Hsen Huang andHsin-Hsi Chen

This paper explores several aspects together for a fine-grained

Chinese discourse analysis. We deal with the issues of ambiguous

discourse markers, ambiguous marker linkings, and more than

one discourse marker. A universal feature representation is

proposed. The pair-once postulation, cross-discourse-unit-first

rule and word-pair-marker-first rule select a set of discourse

markers from ambiguous linkings. Marker-Sum feature considers

total contribution of markers and Marker-Preference feature

captures the probability distribution of discourse functions of a

representative marker by using preference rule. The HIT Chinese

discourse relation treebank (HIT-CDTB) is used to evaluate the

proposed models. The 25-way classifier achieves 0.57 micro-

averaged F-score.

Annotating Discourse Relations in SpokenLanguage: A Comparison of the PDTB and CCRFrameworks

Ines Rehbein, Merel Scholman and Vera Demberg

In discourse relation annotation, there is currently a variety of

different frameworks being used, and most of them have been

developed and employed mostly on written data. This raises

a number of questions regarding interoperability of discourse

relation annotation schemes, as well as regarding differences in

discourse annotation for written vs. spoken domains. In this

paper, we describe ouron annotating two spoken domains from

the SPICE Ireland corpus (telephone conversations and broadcast

interviews) according todifferent discourse annotation schemes,

PDTB 3.0 and CCR. We show that annotations in the two schemes

can largely be mappedone another, and discuss differences in

operationalisations of discourse relation schemes which present

a challenge to automatic mapping. We also observe systematic

differences in the prevalence of implicit discourse relations in

spoken data compared to written texts,find that there are also

differences in the types of causal relations between the domains.

Finally, we find that PDTB 3.0 addresses many shortcomings of

PDTB 2.0 wrt. the annotation of spoken discourse, and suggest

further extensions. The new corpus has roughly theof the CoNLL

2015 Shared Task test set, and we hence hope that it will be

a valuable resource for the evaluation of automatic discourse

relation labellers.

Enhancing The RATP-DECODA Corpus WithLinguistic Annotations For Performing A LargeRange Of NLP Tasks

Carole Lailler, Anaïs Landeau, Frédéric Béchet, YannickEstève and Paul Deléglise

In this article, we present the RATP-DECODA Corpus

which is composed by a set of 67 hours of speech from

telephone conversations of a Customer Care Service (CCS).

This corpus is already available on line at http://sldr.

org/sldr000847/fr in its first version. However, many

enhancements have been made in order to allow the development

of automatic techniques to transcript conversations and to capture

their meaning. These enhancements fall into two categories:

firstly, we have increased the size of the corpus with manual

transcriptions from a new operational day; secondly we have

added new linguistic annotations to the whole corpus (either

manually or through an automatic processing) in order to perform

various linguistic tasks from syntactic and semantic parsing to

dialog act tagging and dialog summarization.

Parallel Discourse Annotations on a Corpus ofShort Texts

Manfred Stede, Stergos Afantenos, Andreas Peldszus,Nicholas Asher and Jérémy Perret

We present the first corpus of texts annotated with two

alternative approaches to discourse structure, Rhetorical Structure

Theory (Mann and Thompson, 1988) and Segmented Discourse

Representation Theory (Asher and Lascarides, 2003). 112

short argumentative texts have been analyzed according to these

36

two theories. Furthermore, in previous work, the same texts

have already been annotated for their argumentation structure,

according to the scheme of Peldszus and Stede (2013). This

corpus therefore enables studies of correlations between the

two accounts of discourse structure, and between discourse and

argumentation. We converted the three annotation formats to

a common dependency tree format that enables to compare the

structures, and we describe some initial findings.

An Annotated Corpus of Direct Speech

John Lee and Chak Yan Yeung

We propose a scheme for annotating direct speech in

literary texts, based on the Text Encoding Initiative (TEI)

and the coreference annotation guidelines from the Message

Understanding Conference (MUC). The scheme encodes the

speakers and listeners of utterances in a text, as well as the

quotative verbs that reports the utterances. We measure inter-

annotator agreement on this annotation task. We then present

statistics on a manually annotated corpus that consists of books

from the New Testament. Finally, we visualize the corpus as a

conversational network.

P11 - Morphology (1)Wednesday, May 25, 16:45

Chairperson: Éric de la Clergerie Poster Session

Evaluating the Noisy Channel Model for theNormalization of Historical Texts: Basque,Spanish and Slovene

Izaskun Etxeberria, Iñaki Alegria, Larraitz Uria and MansHulden

This paper presents a method for the normalization of historical

texts using a combination of weighted finite-state transducers and

language models. We have extended our previous work on the

normalization of dialectal texts and tested the method against a

17th century literary work in Basque. This preprocessed corpus

is made available in the LREC repository. The performance

of this method for learning relations between historical and

contemporary word forms is evaluated against resources in three

languages. The method we present learns to map phonological

changes using a noisy channel model. The model is based

on techniques commonly used for phonological inference and

producing Grapheme-to-Grapheme conversion systems encoded

as weighted transducers and produces F-scores above 80% in the

task for Basque. A wider evaluation shows that the approach

performs equally well with all the languages in our evaluation

suite: Basque, Spanish and Slovene. A comparison against other

methods that address the same task is also provided.

Farasa: A New Fast and Accurate Arabic WordSegmenter

Kareem Darwish and Hamdy Mubarak

In this paper, we present Farasa (meaning insight in Arabic),

which is a fast and accurate Arabic segmenter. Segmentation

involves breaking Arabic words into their constituent clitics.

Our approach is based on SVMrank using linear kernels. The

features that we utilized account for: likelihood of stems,

prefixes, suffixes, and their combination; presence in lexicons

containing valid stems and named entities; and underlying stem

templates. Farasa outperforms or equalizes state-of-the-art Arabic

segmenters, namely QATARA and MADAMIRA. Meanwhile,

Farasa is nearly one order of magnitude faster than QATARA

and two orders of magnitude faster than MADAMIRA. The

segmenter should be able to process one billion words in less than

5 hours. Farasa is written entirely in native Java, with no external

dependencies, and is open-source.

A Morphological Lexicon of Esperanto withMorpheme Frequencies

Eckhard Bick

This paper discusses the internal structure of complex Esperanto

words (CWs). Using a morphological analyzer, possible affixation

and compounding is checked for over 50,000 Esperanto lexemes

against a list of 17,000 root words. Morpheme boundaries

in the resulting analyses were then checked manually, creating

a CW dictionary of 28,000 words, representing 56.4% of the

lexicon, or 19.4% of corpus tokens. The error percentage of

the EspGram morphological analyzer for new corpus CWs was

4.3% for types and 6.4% for tokens, with a recall of almost

100%, and wrong/spurious boundaries being more common

than missing ones. For pedagogical purposes a morpheme

frequency dictionary was constructed for a 16 million word

corpus, confirming the importance of agglutinative derivational

morphemes in the Esperanto lexicon. Finally, as a means to reduce

the morphological ambiguity of CWs, we provide POS likelihoods

for Esperanto suffixes.

How does Dictionary Size Influence Performanceof Vietnamese Word Segmentation?

Wuying Liu and Lin Wang

Vietnamese word segmentation (VWS) is a challenging basic

issue for natural language processing. This paper addresses the

problem of how does dictionary size influence VWS performance,

proposes two novel measures: square overlap ratio (SOR)

37

and relaxed square overlap ratio (RSOR), and validates their

effectiveness. The SOR measure is the product of dictionary

overlap ratio and corpus overlap ratio, and the RSOR measure

is the relaxed version of SOR measure under an unsupervised

condition. The two measures both indicate the suitable degree

between segmentation dictionary and object corpus waiting for

segmentation. The experimental results show that the more

suitable, neither smaller nor larger, dictionary size is better

to achieve the state-of-the-art performance for dictionary-based

Vietnamese word segmenters.

Giving Lexical Resources a Second Life:Démonette, a Multi-sourced Morpho-semanticNetwork for French

Nabil Hathout and Fiammetta Namer

Démonette is a derivational morphological network designed for

the description of French. Its original architecture enables its

use as a formal framework for the description of morphological

analyses and as a repository for existing lexicons. It is fed with

a variety of resources, which all are already validated. The

harmonization of their content into a unified format provides them

a second life, in which they are enriched with new properties,

provided these are deductible from their contents. Démonette

is released under a Creative Commons license. It is usable for

theoretical and descriptive research in morphology, as a source

of experimental material for psycholinguistics, natural language

processing (NLP) and information retrieval (IR), where it fills a

gap, since French lacks a large-coverage derivational resources

database. The article presents the integration of two existing

lexicons into Démonette. The first is Verbaction, a lexicon of

deverbal action nouns. The second is Lexeur, a database of agent

nouns in -eur derived from verbs or from nouns.

Syntactic Analysis of Phrasal Compounds inCorpora: a Challenge for NLP Tools

Carola Trips

The paper introduces a “train once, use many” approach for the

syntactic analysis of phrasal compounds (PC) of the type XP+N

like “Would you like to sit on my knee?” nonsense. PCs are a

challenge for NLP tools since they require the identification of a

syntactic phrase within a morphological complex. We propose

a method which uses a state-of-the-art dependency parser not

only to analyse sentences (the environment of PCs) but also

to compound the non-head of PCs in a well-defined particular

condition which is the analysis of the non-head spanning from

the left boundary (mostly marked by a determiner) to the nominal

head of the PC. This method contains the following steps: (a)

the use an English state-of-the-art dependency parser with data

comprising sentences with PCs from the British National Corpus

(BNC), (b) the detection of parsing errors of PCs, (c) the separate

treatment of the non-head structure using the same model, and

(d) the attachment of the non-head to the compound head. The

evaluation of the method showed that the accuracy of 76% could

be improved by adding a step in the PC compounder module

which specified user-defined contexts being sensitive to the part

of speech of the non-head parts and by using TreeTagger, in line

with our approach.

DALILA: The Dialectal Arabic LinguisticLearning AssistantSalam Khalifa, Houda Bouamor and Nizar Habash

Dialectal Arabic (DA) poses serious challenges for Natural

Language Processing (NLP). The number and sophistication of

tools and datasets in DA are very limited in comparison to

Modern Standard Arabic (MSA) and other languages. MSA

tools do not effectively model DA which makes the direct use

of MSA NLP tools for handling dialects impractical. This

is particularly a challenge for the creation of tools to support

learning Arabic as a living language on the web, where authentic

material can be found in both MSA and DA. In this paper,

we present the Dialectal Arabic Linguistic Learning Assistant

(DALILA), a Chrome extension that utilizes cutting-edge Arabic

dialect NLP research to assist learners and non-native speakers

in understanding text written in either MSA or DA. DALILA

provides dialectal word analysis and English gloss corresponding

to each word.

Refurbishing a Morphological Database forGermanPetra Steiner

The CELEX database is one of the standard lexical resources for

German. It yields a wealth of data especially for phonological and

morphological applications. The morphological part comprises

deep-structure morphological analyses of German. However, as

it was developed in the Nineties, both encoding and spelling

are outdated. About one fifth of over 50,000 datasets contain

umlauts and signs such as ß. Changes to a modern version

cannot be obtained by simple substitution. In this paper, we

shortly describe the original content and form of the orthographic

and morphological database for German in CELEX. Then we

present our work on modernizing the linguistic data. Lemmas

and morphological analyses are transferred to a modern standard

of encoding by first merging orthographic and morphological

information of the lemmas and their entries and then performing

a second substitution for the morphs within their morphological

analyses. Changes to modern German spelling are performed

by substitution rules according to orthographical standards. We

show an example of the use of the data for the disambiguation of

38

morphological structures. The discussion describes prospects of

future work on this or similar lexicons. The Perl script is publicly

available on our website.

P12 - Sentiment Analysis and OpinionMining (1)Wednesday, May 25, 16:45

Chairperson: German Rigau Poster Session

Encoding Adjective Scales for Fine-grainedResources

Cédric Lopez, Frederique Segond and Christiane Fellbaum

We propose an automatic approach towards determining the

relative location of adjectives on a common scale based on their

strength. We focus on adjectives expressing different degrees

of goodness occurring in French product (perfumes) reviews.

Using morphosyntactic patterns, we extract from the reviews short

phrases consisting of a noun that encodes a particular aspect of

the perfume and an adjective modifying that noun. We then

associate each such n-gram with the corresponding product aspect

and its related star rating. Next, based on the star scores, we

generate adjective scales reflecting the relative strength of specific

adjectives associated with a shared attribute of the product. An

automatic ordering of the adjectives “correct” (correct), “sympa”

(nice), “bon” (good) and “excellent” (excellent) according to their

score in our resource is consistent with an intuitive scale based on

human judgments. Our long-term objective is to generate different

adjective scales in an empirical manner, which could allow the

enrichment of lexical resources.

SCARE – The Sentiment Corpus of App Reviewswith Fine-grained Annotations in German

Mario Sänger, Ulf Leser, Steffen Kemmerer, Peter Adolphsand Roman Klinger

The automatic analysis of texts containing opinions of users about,

e.g., products or political views has gained attention within the

last decades. However, previous work on the task of analyzing

user reviews about mobile applications in app stores is limited.

Publicly available corpora do not exist, such that a comparison

of different methods and models is difficult. We fill this gap by

contributing the Sentiment Corpus of App Reviews (SCARE),

which contains fine-grained annotations of application aspects,

subjective (evaluative) phrases and relations between both. This

corpus consists of 1,760 annotated application reviews from

the Google Play Store with 2,487 aspects and 3,959 subjective

phrases. We describe the process and methodology how the

corpus was created. The Fleiss Kappa between four annotators

reveals an agreement of 0.72. We provide a strong baseline

with a linear-chain conditional random field and word-embedding

features with a performance of 0.62 for aspect detection and 0.63

for the extraction of subjective phrases. The corpus is available to

the research community to support the development of sentiment

analysis methods on mobile application reviews.

Datasets for Aspect-Based Sentiment Analysis inFrenchMarianna Apidianaki, Xavier Tannier and Cécile Richart

Aspect Based Sentiment Analysis (ABSA) is the task of mining

and summarizing opinions from text about specific entities

and their aspects. This article describes two datasets for the

development and testing of ABSA systems for French which

comprise user reviews annotated with relevant entities, aspects and

polarity values. The first dataset contains 457 restaurant reviews

(2365 sentences) for training and testing ABSA systems, while the

second contains 162 museum reviews (655 sentences) dedicated

to out-of-domain evaluation. Both datasets were built as part of

SemEval-2016 Task 5 “Aspect-Based Sentiment Analysis”where

seven different languages were represented, and are publicly

available for research purposes.

ANEW+: Automatic Expansion and Validation ofAffective Norms of Words Lexicons in MultipleLanguagesSamira Shaikh, Kit Cho, Tomek Strzalkowski, LaurieFeldman, John Lien, Ting Liu and George Aaron Broadwell

In this article we describe our method of automatically expanding

an existing lexicon of words with affective valence scores. The

automatic expansion process was done in English. In addition,

we describe our procedure for automatically creating lexicons in

languages where such resources may not previously exist. The

foreign languages we discuss in this paper are Spanish, Russian

and Farsi. We also describe the procedures to systematically

validate our newly created resources. The main contributions of

this work are: 1) A general method for expansion and creation of

lexicons with scores of words on psychological constructs such as

valence, arousal or dominance; and 2) a procedure for ensuring

validity of the newly constructed resources.

PotTS: The Potsdam Twitter Sentiment CorpusUladzimir Sidarenka

In this paper, we introduce a novel comprehensive dataset of 7,992

German tweets, which were manually annotated by two human

experts with fine-grained opinion relations. A rich annotation

scheme used for this corpus includes such sentiment-relevant

elements as opinion spans, their respective sources and targets,

emotionally laden terms with their possible contextual negations

39

and modifiers. Various inter-annotator agreement studies, which

were carried out at different stages of work on these data (at the

initial training phase, upon an adjudication step, and after the

final annotation run), reveal that labeling evaluative judgements

in microblogs is an inherently difficult task even for professional

coders. These difficulties, however, can be alleviated by letting

the annotators revise each other’s decisions. Once rechecked,

the experts can proceed with the annotation of further messages,

staying at a fairly high level of agreement.

Challenges of Evaluating Sentiment AnalysisTools on Social Media

Diana Maynard and Kalina Bontcheva

This paper discusses the challenges in carrying out fair

comparative evaluations of sentiment analysis systems. Firstly,

these are due to differences in corpus annotation guidelines

and sentiment class distribution. Secondly, different systems

often make different assumptions about how to interpret certain

statements, e.g. tweets with URLs. In order to study the impact of

these on evaluation results, this paper focuses on tweet sentiment

analysis in particular. One existing and two newly created corpora

are used, and the performance of four different sentiment analysis

systems is reported; we make our annotated datasets and sentiment

analysis applications publicly available. We see considerable

variations in results across the different corpora, which calls

into question the validity of many existing annotated datasets

and evaluations, and we make some observations about both the

systems and the datasets as a result.

EmoTweet-28: A Fine-Grained Emotion Corpusfor Sentiment Analysis

Jasy Suet Yan Liew, Howard R. Turtle and Elizabeth D.Liddy

This paper describes EmoTweet-28, a carefully curated corpus

of 15,553 tweets annotated with 28 emotion categories for the

purpose of training and evaluating machine learning models for

emotion classification. EmoTweet-28 is, to date, the largest

tweet corpus annotated with fine-grained emotion categories.

The corpus contains annotations for four facets of emotion:

valence, arousal, emotion category and emotion cues. We first

used small-scale content analysis to inductively identify a set of

emotion categories that characterize the emotions expressed in

microblog text. We then expanded the size of the corpus using

crowdsourcing. The corpus encompasses a variety of examples

including explicit and implicit expressions of emotions as well as

tweets containing multiple emotions. EmoTweet-28 represents an

important resource to advance the development and evaluation of

more emotion-sensitive systems.

Happy Accident: A Sentiment CompositionLexicon for Opposing Polarity Phrases

Svetlana Kiritchenko and Saif Mohammad

Sentiment composition is the determining of sentiment of a multi-

word linguistic unit, such as a phrase or a sentence, based on

its constituents. We focus on sentiment composition in phrases

formed by at least one positive and at least one negative word —

phrases like ’happy accident’ and ’best winter break’. We refer to

such phrases as opposing polarity phrases. We manually annotate

a collection of opposing polarity phrases and their constituent

single words with real-valued sentiment intensity scores using a

method known as Best–Worst Scaling. We show that the obtained

annotations are consistent. We explore the entries in the lexicon

for linguistic regularities that govern sentiment composition in

opposing polarity phrases. Finally, we list the current and possible

future applications of the lexicon.

Detecting Implicit Expressions of Affect from Textusing Semantic Knowledge on Common ConceptProperties

Alexandra Balahur and Hristo Tanev

Emotions are an important part of the human experience. They are

responsible for the adaptation and integration in the environment,

offering, most of the time together with the cognitive system,

the appropriate responses to stimuli in the environment. As

such, they are an important component in decision-making

processes. In today’s society, the avalanche of stimuli present in

the environment (physical or virtual) makes people more prone to

respond to stronger affective stimuli (i.e., those that are related

to their basic needs and motivations – survival, food, shelter,

etc.). In media reporting, this is translated in the use of arguments

(factual data) that are known to trigger specific (strong, affective)

behavioural reactions from the readers. This paper describes

initial efforts to detect such arguments from text, based on the

properties of concepts. The final system able to retrieve and

label this type of data from the news in traditional and social

platforms is intended to be integrated Europe Media Monitor

family of applications to detect texts that trigger certain (especially

negative) reactions from the public, with consequences on citizen

safety and security.

Creating a General Russian Sentiment Lexicon

Natalia Loukachevitch and Anatolii Levchik

The paper describes the new Russian sentiment lexicon -

RuSentiLex. The lexicon was gathered from several sources:

40

opinionated words from domain-oriented Russian sentiment

vocabularies, slang and curse words extracted from Twitter,

objective words with positive or negative connotations from a

news collection. The words in the lexicon having different

sentiment orientations in specific senses are linked to appropriate

concepts of the thesaurus of Russian language RuThes. All

lexicon entries are classified according to four sentiment

categories and three sources of sentiment (opinion, emotion,

or fact). The lexicon can serve as the first version for the

construction of domain-specific sentiment lexicons or can be used

for feature generation in machine-learning approaches. In this

role, the RuSentiLex lexicon was utilized by the participants of

the SentiRuEval-2016 Twitter reputation monitoring shared task

and allowed them to achieve high results.

GRaSP: A Multilayered Annotation Scheme forPerspectives

Chantal van Son, Tommaso Caselli, Antske Fokkens, IsaMaks, Roser Morante, Lora Aroyo and Piek Vossen

This paper presents a framework and methodology for the

annotation of perspectives in text. In the last decade, different

aspects of linguistic encoding of perspectives have been targeted

as separated phenomena through different annotation initiatives.

We propose an annotation scheme that integrates these different

phenomena. We use a multilayered annotation approach, splitting

the annotation of different aspects of perspectives into small

subsequent subtasks in order to reduce the complexity of the

task and to better monitor interactions between layers. Currently,

we have included four layers of perspective annotation: events,

attribution, factuality and opinion. The annotations are integrated

in a formal model called GRaSP, which provides the means to

represent instances (e.g. events, entities) and propositions in the

(real or assumed) world in relation to their mentions in text. Then,

the relation between the source and target of a perspective is

characterized by means of perspective annotations. This enables

us to place alternative perspectives on the same entity, event or

proposition next to each other.

Integration of Lexical and Semantic Knowledgefor Sentiment Analysis in SMS

Wejdene Khiari, Mathieu Roche and Asma Bouhafs Hafsia

With the explosive growth of online social media (forums, blogs,

and social networks), exploitation of these new information

sources has become essential. Our work is based on the

sud4science project. The goal of this project is to perform

multidisciplinary work on a corpus of authentic SMS, in French,

collected in 2011 and anonymised (88milSMS corpus: http:

//88milsms.huma-num.fr). This paper highlights a new

method to integrate opinion detection knowledge from an SMS

corpus by combining lexical and semantic information. More

precisely, our approach gives more weight to words with a

sentiment (i.e. presence of words in a dedicated dictionary) for

a classification task based on three classes: positive, negative,

and neutral. The experiments were conducted on two corpora: an

elongated SMS corpus (i.e. repetitions of characters in messages)

and a non-elongated SMS corpus. We noted that non-elongated

SMS were much better classified than elongated SMS. Overall,

this study highlighted that the integration of semantic knowledge

always improves classification.

Specialising Paragraph Vectors for Text PolarityDetection

Fabio Tamburini

This paper presents some experiments for specialising Paragraph

Vectors, a new technique for creating text fragment (phrase,

sentence, paragraph, text, ...) embedding vectors, for text polarity

detection. The first extension regards the injection of polarity

information extracted from a polarity lexicon into embeddings and

the second extension aimed at inserting word order information

into Paragraph Vectors. These two extensions, when training a

logistic-regression classifier on the combined embeddings, were

able to produce a relevant gain in performance when compared

to the standard Paragraph Vector methods proposed by Le and

Mikolov (2014).

Evaluating Lexical Similarity to build SentimentSimilarity

Grégoire Jadi, Vincent Claveau, Béatrice Daille andMonceaux Laura

In this article, we propose to evaluate the lexical similarity

information provided by word representations against several

opinion resources using traditional Information Retrieval tools.

Word representation have been used to build and to extend

opinion resources such as lexicon, and ontology and their

performance have been evaluated on sentiment analysis tasks.

We question this method by measuring the correlation between

the sentiment proximity provided by opinion resources and

the semantic similarity provided by word representations using

different correlation coefficients. We also compare the neighbors

found in word representations and list of similar opinion words.

Our results show that the proximity of words in state-of-the-

art word representations is not very effective to build sentiment

similarity.

41

P13 - Semantics (1)Wednesday, May 25, 16:45

Chairperson: Christian Chiarcos Poster Session

Visualisation and Exploration ofHigh-Dimensional Distributional Features inLexical Semantic Classification

Maximilian Köper, Melanie Zaiß, Qi Han, Steffen Koch andSabine Schulte im Walde

Vector space models and distributional information are widely

used in NLP. The models typically rely on complex, high-

dimensional objects. We present an interactive visualisation tool

to explore salient lexical-semantic features of high-dimensional

word objects and word similarities. Most visualisation tools

provide only one low-dimensional map of the underlying data, so

they are not capable of retaining the local and the global structure.

We overcome this limitation by providing an additional trust-view

to obtain a more realistic picture of the actual object distances.

Additional tool options include the reference to a gold standard

classification, the reference to a cluster analysis as well as listing

the most salient (common) features for a selected subset of the

words.

SemAligner: A Method and Tool for AligningChunks with Semantic Relation Types andSemantic Similarity Scores

Nabin Maharjan, Rajendra Banjade, Nobal Bikram Niraulaand Vasile Rus

This paper introduces a ruled-based method and software tool,

called SemAligner, for aligning chunks across texts in a given

pair of short English texts. The tool, based on the top

performing method at the Interpretable Short Text Similarity

shared task at SemEval 2015, where it was used with human

annotated (gold) chunks, can now additionally process plain text-

pairs using two powerful chunkers we developed, e.g. using

Conditional Random Fields. Besides aligning chunks, the tool

automatically assigns semantic relations to the aligned chunks

(such as EQUI for equivalent and OPPO for opposite) and

semantic similarity scores that measure the strength of the

semantic relation between the aligned chunks. Experiments show

that SemAligner performs competitively for system generated

chunks and that these results are also comparable to results

obtained on gold chunks. SemAligner has other capabilities

such as handling various input formats and chunkers as well as

extending lookup resources.

Aspectual Flexibility Increases with Agentivityand Concreteness A Computational ClassificationExperiment on Polysemous Verbs

Ingrid Falk and Fabienne Martin

We present an experimental study making use of a machine

learning approach to identify the factors that affect the aspectual

value that characterizes verbs under each of their readings. The

study is based on various morpho-syntactic and semantic features

collected from a French lexical resource and on a gold standard

aspectual classification of verb readings designed by an expert.

Our results support the tested hypothesis, namely that agentivity

and abstractness influence lexical aspect.

mwetoolkit+sem: Integrating Word Embeddingsin the mwetoolkit for Semantic MWE Processing

Silvio Cordeiro, Carlos Ramisch and Aline Villavicencio

This paper presents mwetoolkit+sem: an extension of the

mwetoolkit that estimates semantic compositionality scores for

multiword expressions (MWEs) based on word embeddings.

First, we describe our implementation of vector-space operations

working on distributional vectors. The compositionality score is

based on the cosine distance between the MWE vector and the

composition of the vectors of its member words. Our generic

system can handle several types of word embeddings and MWE

lists, and may combine individual word representations using

several composition techniques. We evaluate our implementation

on a dataset of 1042 English noun compounds, comparing

different configurations of the underlying word embeddings and

word-composition models. We show that our vector-based scores

model non-compositionality better than standard association

measures such as log-likelihood.

Cognitively Motivated DistributionalRepresentations of Meaning

Elias Iosif, Spiros Georgiladakis and AlexandrosPotamianos

Although meaning is at the core of human cognition, state-of-

the-art distributional semantic models (DSMs) are often agnostic

to the findings in the area of semantic cognition. In this

work, we present a novel type of DSMs motivated by the dual–

processing cognitive perspective that is triggered by lexico–

semantic activations in the short–term human memory. The

proposed model is shown to perform better than state-of-the-art

models for computing semantic similarity between words. The

fusion of different types of DSMs is also investigated achieving

42

results that are comparable or better than the state-of-the-art. The

used corpora along with a set of tools, as well as large repositories

of vectorial word representations are made publicly available for

four languages (English, German, Italian, and Greek).

Extending Monolingual Semantic TextualSimilarity Task to Multiple Cross-lingual Settings

Yoshihiko Hayashi and Wentao Luo

This paper describes our independent effort for extending the

monolingual semantic textual similarity (STS) task setting to

multiple cross-lingual settings involving English, Japanese, and

Chinese. So far, we have adopted a “monolingual similarity after

translation” strategy to predict the semantic similarity between

a pair of sentences in different languages. With this strategy, a

monolingual similarity method is applied after having (one of) the

target sentences translated into a pivot language. Therefore, this

paper specifically details the required and developed resources to

implement this framework, while presenting our current results

for English-Japanese-Chinese cross-lingual STS tasks that may

exemplify the validity of the framework.

Resources for building applications withDependency Minimal Recursion Semantics

Ann Copestake, Guy Emerson, Michael Wayne Goodman,Matic Horvat, Alexander Kuhnle and Ewa Muszynska

We describe resources aimed at increasing the usability of the

semantic representations utilized within the DELPH-IN (Deep

Linguistic Processing with HPSG) consortium. We concentrate

in particular on the Dependency Minimal Recursion Semantics

(DMRS) formalism, a graph-based representation designed for

compositional semantic representation with deep grammars. Our

main focus is on English, and specifically English Resource

Semantics (ERS) as used in the English Resource Grammar.

We first give an introduction to ERS and DMRS and a brief

overview of some existing resources and then describe in detail

a new repository which has been developed to simplify the use of

ERS/DMRS. We explain a number of operations on DMRS graphs

which our repository supports, with sketches of the algorithms,

and illustrate how these operations can be exploited in application

building. We believe that this work will aid researchers to exploit

the rich and effective but complex DELPH-IN resources.

Subtask Mining from Search Query Logs forHow-Knowledge Acceleration

Chung-Lun Kuo and Hsin-Hsi Chen

How-knowledge is indispensable in daily life, but has relatively

less quantity and poorer quality than what-knowledge in publicly

available knowledge bases. This paper first extracts task-subtask

pairs from wikiHow, then mines linguistic patterns from search

query logs, and finally applies the mined patterns to extract

subtasks to complete given how-to tasks. To evaluate the

proposed methodology, we group tasks and the corresponding

recommended subtasks into pairs, and evaluate the results

automatically and manually. The automatic evaluation shows the

accuracy of 0.4494. We also classify the mined patterns based

on prepositions and find that the prepositions like “on”, “to”, and

“with”have the better performance. The results can be used to

accelerate how-knowledge base construction.

Typology of Adjectives Benchmark forCompositional Distributional Models

Daria Ryzhova, Maria Kyuseva and Denis Paperno

In this paper we present a novel application of compositional

distributional semantic models (CDSMs): prediction of lexical

typology. The paper introduces the notion of typological

closeness, which is a novel rigorous formalization of semantic

similarity based on comparison of multilingual data. Starting

from the Moscow Database of Qualitative Features for adjective

typology, we create four datasets of typological closeness, on

which we test a range of distributional semantic models. We

show that, on the one hand, vector representations of phrases

based on data from one language can be used to predict how

words within the phrase translate into different languages, and,

on the other hand, that typological data can serve as a semantic

benchmark for distributional models. We find that compositional

distributional models, especially parametric ones, perform way

above non-compositional alternatives on the task.

DART: a Dataset of Arguments and theirRelations on Twitter

Tom Bosc, Elena Cabrio and Serena Villata

The problem of understanding the stream of messages exchanged

on social media such as Facebook and Twitter is becoming a

major challenge for automated systems. The tremendous amount

of data exchanged on these platforms as well as the specific

form of language adopted by social media users constitute a new

challenging context for existing argument mining techniques. In

this paper, we describe a resource of natural language arguments

called DART (Dataset of Arguments and their Relations on

Twitter) where the complete argument mining pipeline over

Twitter messages is considered: (i) we identify which tweets can

be considered as arguments and which cannot, and (ii) we identify

43

what is the relation, i.e., support or attack, linking such tweets to

each other.

O13 - Large Projects and InfrastructuresWednesday, May 25, 18:10

Chairperson: Walter Daelemans Oral Session

Port4NooJ v3.0: Integrated Linguistic Resourcesfor Portuguese NLP

Cristina Mota, Paula Carvalho and Anabela Barreiro

This paper introduces Port4NooJ v3.0, the latest version of the

Portuguese module for NooJ, highlights its main features, and

details its three main new components: (i) a lexicon-grammar

based dictionary of 5,177 human intransitive adjectives, and a set

of local grammars that use the distributional properties of those

adjectives for paraphrasing (ii) a polarity dictionary with 9,031

entries for sentiment analysis, and (iii) a set of priority dictionaries

and local grammars for named entity recognition. These new

components were derived and/or adapted from publicly available

resources. The Port4NooJ v3.0 resource is innovative in terms of

the specificity of the linguistic knowledge it incorporates. The

dictionary is bilingual Portuguese-English, and the semantico-

syntactic information assigned to each entry validates the

linguistic relation between the terms in both languages. These

characteristics, which cannot be found in any other public resource

for Portuguese, make it a valuable resource for translation and

paraphrasing. The paper presents the current statistics and

describes the different complementary and synergic components

and integration efforts.

Collecting Language Resources for the Latviane-Government Machine Translation Platform

Roberts Rozis, Andrejs Vasiljevs and Raivis Skadinš

This paper describes corpora collection activity for building large

machine translation systems for Latvian e-Government platform.

We describe requirements for corpora, selection and assessment

of data sources, collection of the public corpora and creation

of new corpora from miscellaneous sources. Methodology,

tools and assessment methods are also presented along with the

results achieved, challenges faced and conclusions made. Several

approaches to address the data scarceness are discussed. We

summarize the volume of obtained corpora and provide quality

metrics of MT systems trained on this data. Resulting MT systems

for English-Latvian, Latvian English and Latvian Russian are

integrated in the Latvian e-service portal and are freely available

on website HUGO.LV. This paper can serve as a guidance for

similar activities initiated in other countries, particularly in the

context of European Language Resource Coordination action.

Nederlab: Towards a Single Portal and ResearchEnvironment for Diachronic Dutch Text CorporaHennie Brugman, Martin Reynaert, Nicoline van der Sijs,René van Stipriaan, Erik Tjong Kim Sang and Antal vanden Bosch

The Nederlab project aims to bring together all digitized texts

relevant to the Dutch national heritage, the history of the Dutch

language and culture (circa 800 – present) in one user friendly

and tool enriched open access web interface. This paper describes

Nederlab halfway through the project period and discusses the

collections incorporated, back-office processes, system back-end

as well as the Nederlab Research Portal end-user web application.

O14 - Document Classification and TextCategorisationWednesday, May 25, 18:10

Chairperson: Robert Frederking Oral Session

A Semi-Supervised Approach for GenderIdentificationJuan Soler and Leo Wanner

In most of the research studies on Author Profiling, large

quantities of correctly labeled data are used to train the models.

However, this does not reflect the reality in forensic scenarios:

in practical linguistic forensic investigations, the resources that

are available to profile the author of a text are usually scarce.

To pay tribute to this fact, we implemented a Semi-Supervised

Learning variant of the k nearest neighbors algorithm that uses

small sets of labeled data and a larger amount of unlabeled

data to classify the authors of texts by gender (man vs woman).

We describe the enriched KNN algorithm and show that the

use of unlabeled instances improves the accuracy of our gender

identification model. We also present a feature set that facilitates

the use of a very small number of instances, reaching accuracies

higher than 70% with only 113 instances to train the model. It is

also shown that the algorithm also performs well using publicly

available data.

Ensemble Classification of Grants usingLDA-based FeaturesYannis Korkontzelos, Beverley Thomas, Makoto Miwa andSophia Ananiadou

Classifying research grants into useful categories is a vital

task for a funding body to give structure to the portfolio

for analysis, informing strategic planning and decision-making.

Automating this classification process would save time and effort,

providing the accuracy of the classifications is maintained. We

44

employ five classification models to classify a set of BBSRC-

funded research grants in 21 research topics based on unigrams,

technical terms and Latent Dirichlet Allocation models. To

boost precision, we investigate methods for combining their

predictions into five aggregate classifiers. Evaluation confirmed

that ensemble classification models lead to higher precision.It

was observed that there is not a single best-performing aggregate

method for all research topics. Instead, the best-performing

method for a research topic depends on the number of positive

training instances available for this topic. Subject matter

experts considered the predictions of aggregate models to correct

erroneous or incomplete manual assignments.

Edit Categories and Editor Role Identification inWikipedia

Diyi Yang, Aaron Halfaker, Robert Kraut and Eduard Hovy

In this work, we introduced a corpus for categorizing edit types in

Wikipedia. This fine-grained taxonomy of edit types enables us

to differentiate editing actions and find editor roles in Wikipedia

based on their low-level edit types. To do this, we first created an

annotated corpus based on 1,996 edits obtained from 953 article

revisions and built machine-learning models to automatically

identify the edit categories associated with edits. Building on

this automated measurement of edit types, we then applied a

graphical model analogous to Latent Dirichlet Allocation to

uncover the latent roles in editors’ edit histories. Applying this

technique revealed eight different roles editors play, such as Social

Networker, Substantive Expert, etc.

O15 - Morphology (1)Wednesday, May 25, 18:10

Chairperson: Tamás Váradi Oral Session

Morphologically Annotated Corpora andMorphological Analyzers for Moroccan andSanaani Yemeni Arabic

Faisal Al shargi, Aidan Kaplan, Ramy Eskander, NizarHabash and Owen Rambow

We present new language resources for Moroccan and Sanaani

Yemeni Arabic. The resources include corpora for each dialect

which have been morphologically annotated, and morphological

analyzers for each dialect which are derived from these corpora.

These are the first sets of resources for Moroccan and Yemeni

Arabic. The resources will be made available to the public.

Merging Data Resources for Inflectional andDerivational Morphology in Czech

Zdenek Žabokrtský, Magda Sevcikova, Milan Straka, JonášVidra and Adéla Limburská

The paper deals with merging two complementary resources of

morphological data previously existing for Czech, namely the

inflectional dictionary MorfFlex CZ and the recently developed

lexical network DeriNet. The MorfFlex CZ dictionary has been

used by a morphological analyzer capable of analyzing/generating

several million Czech word forms according to the rules of

Czech inflection. The DeriNet network contains several hundred

thousand Czech lemmas interconnected with links corresponding

to derivational relations (relations between base words and words

derived from them). After summarizing basic characteristics of

both resources, the process of merging is described, focusing on

both rather technical aspects (growth of the data, measuring the

quality of newly added derivational relations) and linguistic issues

(treating lexical homonymy and vowel/consonant alternations).

The resulting resource contains 970 thousand lemmas connected

with 715 thousand derivational relations and is publicly available

on the web under the CC-BY-NC-SA license. The data

were incorporated in the MorphoDiTa library version 2.0

(which provides morphological analysis, generation, tagging and

lemmatization for Czech) and can be browsed and searched by

two web tools (DeriNet Viewer and DeriNet Search tool).

A New Integrated Open-source MorphologicalAnalyzer for Hungarian

Attila Novák, Borbála Siklósi and Csaba Oravecz

The goal of a Hungarian research project has been to create

an integrated Hungarian natural language processing framework.

This infrastructure includes tools for analyzing Hungarian texts,

integrated into a standardized environment. The morphological

analyzer is one of the core components of the framework. The goal

of this paper is to describe a fast and customizable morphological

analyzer and its development framework, which synthesizes and

further enriches the morphological knowledge implemented in

previous tools existing for Hungarian. In addition, we present

the method we applied to add semantic knowledge to the lexical

database of the morphology. The method utilizes neural word

embedding models and morphological and shallow syntactic

knowledge.

45

O16 - Phonetics and ProsodyWednesday, May 25, 18:10

Chairperson: Dafydd Gibbon Oral Session

New release of Mixer-6: Improved validity forphonetic study of speaker variation andidentification

Eleanor Chodroff, Matthew Maciejewski, Jan Trmal,Sanjeev Khudanpur and John Godfrey

The Mixer series of speech corpora were collected over several

years, principally to support annual NIST evaluations of speaker

recognition (SR) technologies. These evaluations focused on

conversational speech over a variety of channels and recording

conditions. One of the series, Mixer-6, added a new condition,

read speech, to support basic scientific research on speaker

characteristics, as well as technology evaluation. With read speech

it is possible to make relatively precise measurements of phonetic

events and features, which can be correlated with the performance

of speaker recognition algorithms, or directly used in phonetic

analysis of speaker variability. The read speech, as originally

recorded, was adequate for large-scale evaluations (e.g., fixed-text

speaker ID algorithms) but only marginally suitable for acoustic-

phonetic studies. Numerous errors due largely to speaker behavior

remained in the corpus, with no record of their locations or rate

of occurrence. We undertook the effort to correct this situation

with automatic methods supplemented by human listening and

annotation. The present paper describes the tools and methods,

resulting corrections, and some examples of the kinds of research

studies enabled by these enhancements.

Assessing the Prosody of Non-Native Speakers ofEnglish: Measures and Feature Sets

Eduardo Coutinho, Florian Hönig, Yue Zhang, SimoneHantke, Anton Batliner, Elmar Nöth and Björn Schuller

In this paper, we describe a new database with audio recordings of

non-native (L2) speakers of English, and the perceptual evaluation

experiment conducted with native English speakers for assessing

the prosody of each recording. These annotations are then used

to compute the gold standard using different methods, and a

series of regression experiments is conducted to evaluate their

impact on the performance of a regression model predicting

the degree of naturalness of L2 speech. Further, we compare

the relevance of different feature groups modelling prosody in

general (without speech tempo), speech rate and pauses modelling

speech tempo (fluency), voice quality, and a variety of spectral

features. We also discuss the impact of various fusion strategies

on performance.Overall, our results demonstrate that the prosody

of non-native speakers of English as L2 can be reliably assessed

using supra-segmental audio features; prosodic features seem to

be the most important ones.

The IFCASL Corpus of French and GermanNon-native and Native Read Speech

Juergen Trouvain, Anne Bonneau, Vincent Colotte, CamilleFauth, Dominique Fohr, Denis Jouvet, Jeanin Jügler, YvesLaprie, Odile Mella, Bernd Möbius and Frank Zimmerer

The IFCASL corpus is a French-German bilingual phonetic

learner corpus designed, recorded and annotated in a project on

individualized feedback in computer-assisted spoken language

learning. The motivation for setting up this corpus was that

there is no phonetically annotated and segmented corpus for this

language pair of comparable of size and coverage. In contrast to

most learner corpora, the IFCASL corpus incorporate data for a

language pair in both directions, i.e. in our case French learners

of German, and German learners of French. In addition, the

corpus is complemented by two sub-corpora of native speech by

the same speakers. The corpus provides spoken data by about 100

speakers with comparable productions, annotated and segmented

on the word and the phone level, with more than 50% manually

corrected data. The paper reports on inter-annotator agreement

and the optimization of the acoustic models for forced speech-

text alignment in exercises for computer-assisted pronunciation

training. Example studies based on the corpus data with a phonetic

focus include topics such as the realization of /h/ and glottal stop,

final devoicing of obstruents, vowel quantity and quality, pitch

range, and tempo.

P14 - Lexical DatabasesWednesday, May 25, 18:10 - 19:10

Chairperson: Amália Mendes Poster Session

LELIO: An Auto-Adaptative System to AcquireDomain Lexical Knowledge in Technical Texts

Patrick Saint-Dizier

In this paper, we investigate some language acquisition facets of

an auto-adaptative system that can automatically acquire most

of the relevant lexical knowledge and authoring practices for

an application in a given domain. This is the LELIO project:

producing customized LELIE solutions. Our goal, within the

framework of LELIE (a system that tags language uses that do

not follow the Constrained Natural Language principles), is to

automate the long, costly and error prone lexical customization

of LELIE to a given application domain. Technical texts

being relatively restricted in terms of syntax and lexicon, results

46

obtained show that this approach is feasible and relatively reliable.

By auto-adaptative, we mean that the system learns from a sample

of the application corpus the various lexical terms and uses crucial

for LELIE to work properly (e.g. verb uses, fuzzy terms, business

terms, stylistic patterns). A technical writer validation method is

developed at each step of the acquisition.

Wikification for Scriptio Continua

Yugo Murawaki and Shinsuke Mori

The fact that Japanese employs scriptio continua, or a writing

system without spaces, complicates the first step of an NLP

pipeline. Word segmentation is widely used in Japanese

language processing, and lexical knowledge is crucial for reliable

identification of words in text. Although external lexical resources

like Wikipedia are potentially useful, segmentation mismatch

prevents them from being straightforwardly incorporated into the

word segmentation task. If we intentionally violate segmentation

standards with the direct incorporation, quantitative evaluation

will be no longer feasible. To address this problem, we propose

to define a separate task that directly links given texts to an

external resource, that is, wikification in the case of Wikipedia. By

doing so, we can circumvent segmentation mismatch that may not

necessarily be important for downstream applications. As the first

step to realize the idea, we design the task of Japanese wikification

and construct wikification corpora. We annotated subsets of

the Balanced Corpus of Contemporary Written Japanese plus

Twitter short messages. We also implement a simple wikifier and

investigate its performance on these corpora.

Accessing and Elaborating Walenty – a ValenceDictionary of Polish – via Internet Browser

Bartłomiej Niton, Tomasz Bartosiak and Elzbieta Hajnicz

This article presents Walenty - a new valence dictionary of Polish

predicates, concentrating on its creation process and access via

Internet browser. The dictionary contains two layers, syntactic

and semantic. The syntactic layer describes syntactic and

morphosyntactic constraints predicates put on their dependants.

The semantic layer shows how predicates and their arguments

are involved in a situation described in an utterance. These two

layers are connected, representing how semantic arguments can

be realised on the surface. Walenty also contains a powerful

phraseological (idiomatic) component. Walenty has been created

and can be accessed remotely with a dedicated tool called Slowal.

In this article, we focus on most important functionalities of this

system. First, we will depict how to access the dictionary and

how built-in filtering system (covering both syntactic and semantic

phenomena) works. Later, we will describe the process of creating

dictionary by Slowal tool that both supports and controls the work

of lexicographers.

CEPLEXicon – A Lexicon of Child EuropeanPortugueseAna Lúcia Santos, Maria João Freitas and Aida Cardoso

CEPLEXicon (version 1.1) is a child lexicon resulting from

the automatic tagging of two child corpora: the corpus Santos

(Santos, 2006; Santos et al. 2014) and the corpus Child

– Adult Interaction (Freitas et al. 2012), which integrates

information from the corpus Freitas (Freitas, 1997). This

lexicon includes spontaneous speech produced by seven children

(1;02.00 to 3;11.12) during approximately 86h of child-adult

interaction. The automatic tagging comprised the lemmatization

and morphosyntactic classification of the speech produced by

the seven children included in the two child corpora; the

lexicon contains information pertaining to lemmas and syntactic

categories as well as absolute number of occurrences and

frequencies in three age intervals: < 2 years; 2 years and < 3 years;

3 years. The information included in this lexicon and the format in

which it is presented enables research in different areas and allows

researchers to obtain measures of lexical growth. CEPLEXicon is

available through the ELRA catalogue.

Extracting Weighted Language Lexicons fromWikipediaGregory Grefenstette

Language models are used in applications as diverse as

speech recognition, optical character recognition and information

retrieval. They are used to predict word appearance, and to

weight the importance of words in these applications. One basic

element of language models is the list of words in a language.

Another is the unigram frequency of each word. But this basic

information is not available for most languages in the world. Since

the multilingual Wikipedia project encourages the production of

encyclopedic-like articles in many world languages, we can find

there an ever-growing source of text from which to extract these

two language modelling elements: word list and frequency. Here

we present a simple technique for converting this Wikipedia

text into lexicons of weighted unigrams for the more than 280

languages present currently present in Wikipedia. The lexicons

produced, and the source code for producing them in a Linux-

based system are here made available for free on the Web.

Wiktionnaire’s Wikicode GLAWIfied: a WorkableFrench Machine-Readable DictionaryNabil Hathout and Franck Sajous

GLAWI is a free, large-scale and versatile Machine-Readable

Dictionary (MRD) that has been extracted from the French

language edition of Wiktionary, called Wiktionnaire. In

(Sajous and Hathout, 2015), we introduced GLAWI, gave the

47

rationale behind the creation of this lexicographic resource and

described the extraction process, focusing on the conversion

and standardization of the heterogeneous data provided by this

collaborative dictionary. In the current article, we describe the

content of GLAWI and illustrate how it is structured. We also

suggest various applications, ranging from linguistic studies, NLP

applications to psycholinguistic experimentation. They all can

take advantage of the diversity of the lexical knowledge available

in GLAWI. Besides this diversity and extensive lexical coverage,

GLAWI is also remarkable because it is the only free lexical

resource of contemporary French that contains definitions. This

unique material opens way to the renewal of MRD-based methods,

notably the automated extraction and acquisition of semantic

relations.

P15 - MultimodalityWednesday, May 25, 18:10 - 19:10

Chairperson: Carlo Strapparava Poster Session

A Corpus of Images and Text in Online News

Laura Hollink, Adriatik Bedjeti, Martin van Harmelen andDesmond Elliott

In recent years, several datasets have been released that include

images and text, giving impulse to new methods that combine

natural language processing and computer vision. However, there

is a need for datasets of images in their natural textual context.

The ION corpus contains 300K news articles published between

August 2014 - 2015 in five online newspapers from two countries.

The 1-year coverage over multiple publishers ensures a broad

scope in terms of topics, image quality and editorial viewpoints.

The corpus consists of JSON-LD files with the following data

about each article: the original URL of the article on the news

publisher’s website, the date of publication, the headline of the

article, the URL of the image displayed with the article (if

any), and the caption of that image. Neither the article text

nor the images themselves are included in the corpus. Instead,

the images are distributed as high-dimensional feature vectors

extracted from a Convolutional Neural Network, anticipating their

use in computer vision tasks. The article text is represented as a

list of automatically generated entity and topic annotations in the

form of Wikipedia/DBpedia pages. This facilitates the selection

of subsets of the corpus for separate analysis or evaluation.

BosphorusSign: A Turkish Sign LanguageRecognition Corpus in Health and FinanceDomainsNecati Cihan Camgöz, Ahmet Alp Kındıroglu, SerpilKarabüklü, Meltem Kelepir, Ayse Sumru Özsoy and LaleAkarun

There are as many sign languages as there are deaf communities

in the world. Linguists have been collecting corpora of different

sign languages and annotating them extensively in order to study

and understand their properties. On the other hand, the field of

computer vision has approached the sign language recognition

problem as a grand challenge and research efforts have intensified

in the last 20 years. However, corpora collected for studying

linguistic properties are often not suitable for sign language

recognition as the statistical methods used in the field require large

amounts of data. Recently, with the availability of inexpensive

depth cameras, groups from the computer vision community have

started collecting corpora with large number of repetitions for

sign language recognition research. In this paper, we present the

BosphorusSign Turkish Sign Language corpus, which consists of

855 sign and phrase samples from the health, finance and everyday

life domains. The corpus is collected using the state-of-the-art

Microsoft Kinect v2 depth sensor, and will be the first in this sign

language research field. Furthermore, there will be annotations

rendered by linguists so that the corpus will appeal both to the

linguistic and sign language recognition research communities.

The CIRDO Corpus: Comprehensive Audio/VideoDatabase of Domestic Falls of Elderly PeopleMichel Vacher, Saïda Bouakaz, Marc-Eric BobillierChaumon, Frédéric Aman, R. A. Khan, Slima Bekkadja,François Portet, Erwan Guillou, Solange Rossato andBenjamin Lecouteux

Ambient Assisted Living aims at enhancing the quality of life of

older and disabled people at home thanks to Smart Homes. In

particular, regarding elderly living alone at home, the detection of

distress situation after a fall is very important to reassure this kind

of population. However, many studies do not include tests in real

settings, because data collection in this domain is very expensive

and challenging and because of the few available data sets. The

C IRDO corpus is a dataset recorded in realistic conditions in

D OMUS , a fully equipped Smart Home with microphones

and home automation sensors, in which participants performed

scenarios including real falls on a carpet and calls for help. These

scenarios were elaborated thanks to a field study involving elderly

persons. Experiments related in a first part to distress detection in

real-time using audio and speech analysis and in a second part to

fall detection using video analysis are presented. Results show the

difficulty of the task. The database can be used as standardized

48

database by researchers to evaluate and compare their systems for

elderly person’s assistance.

Semi-automatically Alignment of Predicatesbetween Speech and OntoNotes data

Niraj Shrestha and Marie-Francine Moens

Speech data currently receives a growing attention and is an

important source of information. We still lack suitable corpora of

transcribed speech annotated with semantic roles that can be used

for semantic role labeling (SRL), which is not the case for written

data. Semantic role labeling in speech data is a challenging and

complex task due to the lack of sentence boundaries and the many

transcription errors such as insertion, deletion and misspellings

of words. In written data, SRL evaluation is performed at

the sentence level, but in speech data sentence boundaries

identification is still a bottleneck which makes evaluation more

complex. In this work, we semi-automatically align the predicates

found in transcribed speech obtained with an automatic speech

recognizer (ASR) with the predicates found in the corresponding

written documents of the OntoNotes corpus and manually align

the semantic roles of these predicates thus obtaining annotated

semantic frames in the speech data. This data can serve as gold

standard alignments for future research in semantic role labeling

of speech data.

CORILSE: a Spanish Sign Language Repositoryfor Linguistic Analysis

María del Carmen Cabeza-Pereiro, José M. García-Miguel, Carmen García Mateo and José Luis Alba Castro

CORILSE is a computerized corpus of Spanish Sign Language

(Lengua de Signos Española, LSE). It consists of a set of

recordings from different discourse genres by Galician signers

living in the city of Vigo. In this paper we describe its annotation

system, developed on the basis of pre-existing ones (mostly the

model of Auslan corpus). This includes primary annotation of id-

glosses for manual signs, annotation of non-manual component,

and secondary annotation of grammatical categories and relations,

because this corpus is been built for grammatical analysis, in

particular argument structures in LSE. Up until this moment the

annotation has been basically made by hand, which is a slow and

time-consuming task. The need to facilitate this process leads us

to engage in the development of automatic or semi-automatic tools

for manual and facial recognition. Finally, we also present the web

repository that will make the corpus available to different types of

users, and will allow its exploitation for research purposes and

other applications (e.g. teaching of LSE or design of tasks for

signed language assessment).

The OFAI Multi-Modal Task Description Corpus

Stephanie Schreitter and Brigitte Krenn

The OFAI Multimodal Task Description Corpus (OFAI-MMTD

Corpus) is a collection of dyadic teacher-learner (human-human

and human-robot) interactions. The corpus is multimodal

and tracks the communication signals exchanged between

interlocutors in task-oriented scenarios including speech, gaze and

gestures. The focus of interest lies on the communicative signals

conveyed by the teacher and which objects are salient at which

time. Data are collected from four different task description setups

which involve spatial utterances, navigation instructions and more

complex descriptions of joint tasks.

A Japanese Chess Commentary Corpus

Shinsuke Mori, John Richardson, Atsushi Ushiku, TetsuroSasada, Hirotaka Kameko and Yoshimasa Tsuruoka

In recent years there has been a surge of interest in the natural

language prosessing related to the real world, such as symbol

grounding, language generation, and nonlinguistic data search by

natural language queries. In order to concentrate on language

ambiguities, we propose to use a well-defined “real world”, that

is game states. We built a corpus consisting of pairs of sentences

and a game state. The game we focus on is shogi (Japanese

chess). We collected 742,286 commentary sentences in Japanese.

They are spontaneously generated contrary to natural language

annotations in many image datasets provided by human workers

on Amazon Mechanical Turk. We defined domain specific named

entities and we segmented 2,508 sentences into words manually

and annotated each word with a named entity tag. We describe a

detailed definition of named entities and show some statistics of

our game commentary corpus. We also show the results of the

experiments of word segmentation and named entity recognition.

The accuracies are as high as those on general domain texts

indicating that we are ready to tackle various new problems related

to the real world.

The CAMOMILE Collaborative AnnotationPlatform for Multi-modal, Multi-lingual andMulti-media Documents

Johann Poignant, Mateusz Budnik, Hervé Bredin, ClaudeBarras, Mickael Stefas, Pierrick Bruneau, Gilles Adda,Laurent Besacier, Hazim Ekenel, Gil Francopoulo, JavierHernando, Joseph Mariani, Ramon Morros, GeorgesQuénot, Sophie Rosset and Thomas Tamisier

In this paper, we describe the organization and the implementation

of the CAMOMILE collaborative annotation framework for

49

multimodal, multimedia, multilingual (3M) data. Given the

versatile nature of the analysis which can be performed on 3M

data, the structure of the server was kept intentionally simple

in order to preserve its genericity, relying on standard Web

technologies. Layers of annotations, defined as data associated

to a media fragment from the corpus, are stored in a database and

can be managed through standard interfaces with authentication.

Interfaces tailored specifically to the needed task can then be

developed in an agile way, relying on simple but reliable services

for the management of the centralized annotations. We then

present our implementation of an active learning scenario for

person annotation in video, relying on the CAMOMILE server;

during a dry run experiment, the manual annotation of 716 speech

segments was thus propagated to 3504 labeled tracks. The code of

the CAMOMILE framework is distributed in open source.

Finding Recurrent Features of Image SchemaGestures: the FIGURE corpus

Andy Luecking, Alexander Mehler, Désirée Walther, MarcelMauri and Dennis Kurfürst

The Frankfurt Image GestURE corpus (FIGURE) is introduced.

The corpus data is collected in an experimental setting where 50

naive participants spontaneously produced gestures in response

to five to six terms from a total of 27 stimulus terms. The

stimulus terms have been compiled mainly from image schemata

from psycholinguistics, since such schemata provide a panoply

of abstract contents derived from natural language use. The

gestures have been annotated for kinetic features. FIGURE

aims at finding (sets of) stable kinetic feature configurations

associated with the stimulus terms. Given such configurations,

they can be used for designing HCI gestures that go beyond pre-

defined gesture vocabularies or touchpad gestures. It is found,

for instance, that movement trajectories are far more informative

than handshapes, speaking against purely handshape-based HCI

vocabularies. Furthermore, the mean temporal duration of hand

and arm movements associated vary with the stimulus terms,

indicating a dynamic dimension not covered by vocabulary-based

approaches. Descriptive results are presented and related to

findings from gesture studies and natural language dialogue.

An Interaction-Centric Dataset for LearningAutomation Rules in Smart Homes

Kai Frederic Engelmann, Patrick Holthaus, Britta Wredeand Sebastian Wrede

The term smart home refers to a living environment that by

its connected sensors and actuators is capable of providing

intelligent and contextualised support to its user. This may

result in automated behaviors that blends into the user’s daily

life. However, currently most smart homes do not provide

such intelligent support. A first step towards such intelligent

capabilities lies in learning automation rules by observing the

user’s behavior. We present a new type of corpus for learning

such rules from user behavior as observed from the events in

a smart homes sensor and actuator network. The data contains

information about intended tasks by the users and synchronized

events from this network. It is derived from interactions of 59

users with the smart home in order to solve five tasks. The corpus

contains recordings of more than 40 different types of data streams

and has been segmented and pre-processed to increase signal

quality. Overall, the data shows a high noise level on specific data

types that can be filtered out by a simple smoothing approach. The

resulting data provides insights into event patterns resulting from

task specific user behavior and thus constitutes a basis for machine

learning approaches to learn automation rules.

A Web Tool for Building Parallel Corpora ofSpoken and Sign Languages

Alex Becker, Fabio Kepler and Sara Candeias

In this paper we describe our work in building an online tool

for manually annotating texts in any spoken language with

SignWriting in any sign language. The existence of such tool will

allow the creation of parallel corpora between spoken and sign

languages that can be used to bootstrap the creation of efficient

tools for the Deaf community. As an example, a parallel corpus

between English and American Sign Language could be used

for training Machine Learning models for automatic translation

between the two languages. Clearly, this kind of tool must be

designed in a way that it eases the task of human annotators, not

only by being easy to use, but also by giving smart suggestions

as the annotation progresses, in order to save time and effort.

By building a collaborative, online, easy to use annotation tool

for building parallel corpora between spoken and sign languages

we aim at helping the development of proper resources for

sign languages that can then be used in state-of-the-art models

currently used in tools for spoken languages. There are several

issues and difficulties in creating this kind of resource, and our

presented tool already deals with some of them, like adequate text

representation of a sign and many to many alignments between

words and signs.

50

P16 - OntologiesWednesday, May 25, 18:10 - 19:10

Chairperson: Elena Montiel Ponsoda Poster Session

Issues and Challenges in Annotating Urdu ActionVerbs on the IMAGACT4ALL Platform

Sharmin Muzaffar, Pitambar Behera and Girish Jha

In South-Asian languages such as Hindi and Urdu, action verbs

having compound constructions and serial verbs constructions

pose serious problems for natural language processing and

other linguistic tasks. Urdu is an Indo-Aryan language

spoken by 51, 500, 0001 speakers in India. Action verbs

that occur spontaneously in day-to-day communication are

highly ambiguous in nature semantically and as a consequence

cause disambiguation issues that are relevant and applicable

to Language Technologies (LT) like Machine Translation (MT)

and Natural Language Processing (NLP). IMAGACT4ALL is an

ontology-driven web-based platform developed by the University

of Florence for storing action verbs and their inter-relations.

This group is currently collaborating with Jawaharlal Nehru

University (JNU) in India to connect Indian languages on this

platform. Action verbs are frequently used in both written and

spoken discourses and refer to various meanings because of their

polysemic nature. The IMAGACT4ALL platform stores each

3d animation image, each one of them referring to a variety of

possible ontological types, which in turn makes the annotation

task for the annotator quite challenging with regard to selecting

verb argument structure having a range of probability distribution.

The authors, in this paper, discuss the issues and challenges

such as complex predicates (compound and conjunct verbs),

ambiguously animated video illustrations, semantic discrepancies,

and the factors of verb-selection preferences that have produced

significant problems in annotating Urdu verbs on the IMAGACT

ontology.

Domain Ontology Learning Enhanced byOptimized Relation Instance in DBpedia

Liumingjing Xiao, Chong Ruan, An Yang, Junhao Zhangand Junfeng Hu

Ontologies are powerful to support semantic based applications

and intelligent systems. While ontology learning are challenging

due to its bottleneck in handcrafting structured knowledge sources

and training data. To address this difficulty, many researchers turn

to ontology enrichment and population using external knowledge

sources such as DBpedia. In this paper, we propose a method

using DBpedia in a different manner. We utilize relation instances

in DBpedia to supervise the ontology learning procedure from

unstructured text, rather than populate the ontology structure as

a post-processing step. We construct three language resources

in areas of computer science: enriched Wikipedia concept tree,

domain ontology, and gold standard from NSFC taxonomy.

Experiment shows that the result of ontology learning from corpus

of computer science can be improved via the relation instances

extracted from DBpedia in the same field. Furthermore, making

distinction between the relation instances and applying a proper

weighting scheme in the learning procedure lead to even better

result.

Constructing a Norwegian Academic WordlistJanne M Johannessen, Arash Saidi and Kristin Hagen

We present the development of a Norwegian Academic Wordlist

(AKA list) for the Norwegian Bokmål variety. To identify

specific academic vocabulary we developed a 100-million-word

academic corpus based on the University of Oslo archive of digital

publications. Other corpora were used for testing and developing

general word lists. We tried two different methods, those of

Carlund et al. (2012) and Gardner & Davies (2013), and compared

them. The resulting list is presented on a web site, where the

words can be inspected in different ways, and freely downloaded.

The Event and Implied Situation Ontology (ESO):Application and EvaluationRoxane Segers, Marco Rospocher, Piek Vossen, EgoitzLaparra, German Rigau and Anne-Lyse Minard

This paper presents the Event and Implied Situation Ontology

(ESO), a manually constructed resource which formalizes the

pre and post situations of events and the roles of the entities

affected by an event. The ontology is built on top of existing

resources such as WordNet, SUMO and FrameNet. The ontology

is injected to the Predicate Matrix, a resource that integrates

predicate and role information from amongst others FrameNet,

VerbNet, PropBank, NomBank and WordNet. We illustrate how

these resources are used on large document collections to detect

information that otherwise would have remained implicit. The

ontology is evaluated on two aspects: recall and precision based

on a manually annotated corpus and secondly, on the quality of

the knowledge inferred by the situation assertions in the ontology.

Evaluation results on the quality of the system show that 50% of

the events typed and enriched with ESO assertions are correct.

Combining Ontologies and Neural Networks forAnalyzing Historical Language Varieties. A CaseStudy in Middle Low GermanMaria Sukhareva and Christian Chiarcos

In this paper, we describe experiments on the morphosyntactic

annotation of historical language varieties for the example of

Middle Low German (MLG), the official language of the German

51

Hanse during the Middle Ages and a dominant language around

the Baltic Sea by the time. To our best knowledge, this is

the first experiment in automatically producing morphosyntactic

annotations for Middle Low German, and accordingly, no part-of-

speech (POS) tagset is currently agreed upon. In our experiment,

we illustrate how ontology-based specifications of projected

annotations can be employed to circumvent this issue: Instead

of training and evaluating against a given tagset, we decomponse

it into independent features which are predicted independently

by a neural network. Using consistency constraints (axioms)

from an ontology, then, the predicted feature probabilities are

decoded into a sound ontological representation. Using these

representations, we can finally bootstrap a POS tagset capturing

only morphosyntactic features which could be reliably predicted.

In this way, our approach is capable to optimize precision

and recall of morphosyntactic annotations simultaneously with

bootstrapping a tagset rather than performing iterative cycles.

Ecological Gestures for HRI: the GEE Corpus

Maxence Girard-Rivier, Romain Magnani, VeroniqueAuberge, Yuko Sasa, Liliya Tsvetanova, Frederic Aman andClarisse Bayol

As part of a human-robot interaction project, we are interested

by gestural modality as one of many ways to communicate. In

order to develop a relevant gesture recognition system associated

to a smart home butler robot. Our methodology is based on an

IQ game-like Wizard of Oz experiment to collect spontaneous and

implicitly produced gestures in an ecological context. During the

experiment, the subject has to use non-verbal cues (i.e. gestures)

to interact with a robot that is the referee. The subject is unaware

that his gestures will be the focus of our study. In the second

part of the experiment, we asked the subjects to do the gestures

he had produced in the experiment, those are the explicit gestures.

The implicit gestures are compared with explicitly produced ones

to determine a relevant ontology. This preliminary qualitative

analysis will be the base to build a big data corpus in order to

optimize acceptance of the gesture dictionary in coherence with

the “socio-affective glue” dynamics.

A Taxonomy of Spanish Nouns, a StatisticalAlgorithm to Generate it and its Implementationin Open Source Code

Rogelio Nazar and Irene Renau

In this paper we describe our work in progress in the automatic

development of a taxonomy of Spanish nouns, we offer the Perl

implementation we have so far, and we discuss the different

problems that still need to be addressed. We designed a

statistically-based taxonomy induction algorithm consisting of a

combination of different strategies not involving explicit linguistic

knowledge. Being all quantitative, the strategies we present

are however of different nature. Some of them are based on

the computation of distributional similarity coefficients which

identify pairs of sibling words or co-hyponyms, while others are

based on asymmetric co-occurrence and identify pairs of parent-

child words or hypernym-hyponym relations. A decision making

process is then applied to combine the results of the previous steps,

and finally connect lexical units to a basic structure containing the

most general categories of the language. We evaluate the quality

of the taxonomy both manually and also using Spanish Wordnet as

a gold-standard. We estimate an average of 89.07% precision and

25.49% recall considering only the results which the algorithm

presents with high degree of certainty, or 77.86% precision and

33.72% recall considering all results.

P17 - Part of Speech Tagging (1)Wednesday, May 25, 18:10 - 19:10

Chairperson: Krister Linden Poster Session

FOLK-Gold – A Gold Standard forPart-of-Speech-Tagging of Spoken German

Swantje Westpfahl and Thomas Schmidt

In this paper, we present a GOLD standard of part-of-

speech tagged transcripts of spoken German. The GOLD

standard data consists of four annotation layers – transcription

(modified orthography), normalization (standard orthography),

lemmatization and POS tags – all of which have undergone careful

manual quality control. It comes with guidelines for the manual

POS annotation of transcripts of German spoken data and an

extended version of the STTS (Stuttgart Tübingen Tagset) which

accounts for phenomena typically found in spontaneous spoken

German. The GOLD standard was developed on the basis of the

Research and Teaching Corpus of Spoken German, FOLK, and is,

to our knowledge, the first such dataset based on a wide variety of

spontaneous and authentic interaction types. It can be used as a

basis for further development of language technology and corpus

linguistic applications for German spoken language.

Fast and Robust POS tagger for Arabic TweetsUsing Agreement-based Bootstrapping

Fahad Albogamy and Allan Ramsay

Part-of-Speech(POS) tagging is a key step in many NLP

algorithms. However, tweets are difficult to POS tag because

they are short, are not always written maintaining formal grammar

and proper spelling, and abbreviations are often used to overcome

52

their restricted lengths. Arabic tweets also show a further range

of linguistic phenomena such as usage of different dialects,

romanised Arabic and borrowing foreign words. In this paper,

we present an evaluation and a detailed error analysis of state-

of-the-art POS taggers for Arabic when applied to Arabic tweets.

On the basis of this analysis, we combine normalisation and

external knowledge to handle the domain noisiness and exploit

bootstrapping to construct extra training data in order to improve

POS tagging for Arabic tweets. Our results show significant

improvements over the performance of a number of well-known

taggers for Arabic.

Lemmatization and Morphological Tagging inGerman and Latin: A Comparison and a Surveyof the State-of-the-art

Steffen Eger, Rüdiger Gleim and Alexander Mehler

This paper relates to the challenge of morphological tagging

and lemmatization in morphologically rich languages by example

of German and Latin. We focus on the question what a

practitioner can expect when using state-of-the-art solutions out

of the box. Moreover, we contrast these with old(er) methods and

implementations for POS tagging. We examine to what degree

recent efforts in tagger development are reflected by improved

accuracies – and at what cost, in terms of training and processing

time. We also conduct in-domain vs. out-domain evaluation.

Out-domain evaluations are particularly insightful because the

distribution of the data which is being tagged by a user will

typically differ from the distribution on which the tagger has

been trained. Furthermore, two lemmatization techniques are

evaluated. Finally, we compare pipeline tagging vs. a tagging

approach that acknowledges dependencies between inflectional

categories.

TLT-CRF: A Lexicon-supported MorphologicalTagger for Latin Based on Conditional RandomFields

Tim vor der Brück and Alexander Mehler

We present a morphological tagger for Latin, called TTLab Latin

Tagger based on Conditional Random Fields (TLT-CRF) which

uses a large Latin lexicon. Beyond Part of Speech (PoS), TLT-

CRF tags eight inflectional categories of verbs, adjectives or

nouns. It utilizes a statistical model based on CRFs together with

a rule interpreter that addresses scenarios of sparse training data.

We present results of evaluating TLT-CRF to answer the question

what can be learnt following the paradigm of 1st order CRFs in

conjunction with a large lexical resource and a rule interpreter.

Furthermore, we investigate the contigency of representational

features and targeted parts of speech to learn about selective

features.

Cross-lingual and Supervised Models forMorphosyntactic Annotation: a Comparison onRomanian

Lauriane Aufrant, Guillaume Wisniewski and FrançoisYvon

Because of the small size of Romanian corpora, the performance

of a PoS tagger or a dependency parser trained with the standard

supervised methods fall far short from the performance achieved

in most languages. That is why, we apply state-of-the-art methods

for cross-lingual transfer on Romanian tagging and parsing,

from English and several Romance languages. We compare

the performance with monolingual systems trained with sets of

different sizes and establish that training on a few sentences in

target language yields better results than transferring from large

datasets in other languages.

Corpus vs. Lexicon Supervision inMorphosyntactic Tagging: the Case of Slovene

Nikola Ljubešic and Tomaž Erjavec

In this paper we present a tagger developed for inflectionally rich

languages for which both a training corpus and a lexicon are

available. We do not constrain the tagger by the lexicon entries,

allowing both for lexicon incompleteness and noisiness. By using

the lexicon indirectly through features we allow for known and

unknown words to be tagged in the same manner. We test our

tagger on Slovene data, obtaining a 25% error reduction of the

best previous results both on known and unknown words. Given

that Slovene is, in comparison to some other Slavic languages, a

well-resourced language, we perform experiments on the impact

of token (corpus) vs. type (lexicon) supervision, obtaining useful

insights in how to balance the effort of extending resources to

yield better tagging results.

P18 - Treebanks (1)Wednesday, May 25, 18:10 - 19:10

Chairperson: Béatrice Daille Poster Session

Challenges and Solutions for ConsistentAnnotation of Vietnamese Treebank

Quy Nguyen, Yusuke Miyao, Ha Le and Ngan Nguyen

Treebanks are important resources for researchers in natural

language processing, speech recognition, theoretical linguistics,

etc. To strengthen the automatic processing of the Vietnamese

language, a Vietnamese treebank has been built. However, the

53

quality of this treebank is not satisfactory and is a possible source

for the low performance of Vietnamese language processing. We

have been building a new treebank for Vietnamese with about

40,000 sentences annotated with three layers: word segmentation,

part-of-speech tagging, and bracketing. In this paper, we describe

several challenges of Vietnamese language and how we solve them

in developing annotation guidelines. We also present our methods

to improve the quality of the annotation guidelines and ensure

annotation accuracy and consistency. Experiment results show

that inter-annotator agreement ratios and accuracy are higher than

90% which is satisfactory.

Correcting Errors in a Treebank Based on TreeMining

Kanta Suzuki, Yoshihide Kato and Shigeki Matsubara

This paper provides a new method to correct annotation errors

in a treebank. The previous error correction method constructs

a pseudo parallel corpus where incorrect partial parse trees are

paired with correct ones, and extracts error correction rules from

the parallel corpus. By applying these rules to a treebank, the

method corrects errors. However, this method does not achieve

wide coverage of error correction. To achieve wide coverage,

our method adopts a different approach. In our method, we

consider that an infrequent pattern which can be transformed to

a frequent one is an annotation error pattern. Based on a tree

mining technique, our method seeks such infrequent tree patterns,

and constructs error correction rules each of which consists of

an infrequent pattern and a corresponding frequent pattern. We

conducted an experiment using the Penn Treebank. We obtained

1,987 rules which are not constructed by the previous method, and

the rules achieved good precision.

4Couv: A New Treebank for French

Philippe Blache, Gregoire de Montcheuil, Laurent Prévotand Stéphane Rauzy

The question of the type of text used as primary data in treebanks

is of certain importance. First, it has an influence at the discourse

level: an article is not organized in the same way as a novel or a

technical document. Moreover, it also has consequences in terms

of semantic interpretation: some types of texts can be easier to

interpret than others. We present in this paper a new type of

treebank which presents the particularity to answer to specific

needs of experimental linguistic. It is made of short texts (book

backcovers) that presents a strong coherence in their organization

and can be rapidly interpreted. This type of text is adapted to short

reading sessions, making it easy to acquire physiological data (e.g.

eye movement, electroencepholagraphy). Such a resource offers

reliable data when looking for correlations between computational

models and human language processing.

CINTIL DependencyBank PREMIUM - A Corpusof Grammatical Dependencies for PortugueseRita de Carvalho, Andreia Querido, Marisa Campos, RitaValadas Pereira, João Silva and António Branco

This paper presents a new linguistic resource for the study

and computational processing of Portuguese. CINTIL

DependencyBank PREMIUM is a corpus of Portuguese news text,

accurately manually annotated with a wide range of linguistic

information (morpho-syntax, named-entities, syntactic function

and semantic roles), making it an invaluable resource specially for

the development and evaluation of data-driven natural language

processing tools. The corpus is under active development,

reaching 4,000 sentences in its current version. The paper also

reports on the training and evaluation of a dependency parser

over this corpus. CINTIL DependencyBank PREMIUM is freely-

available for research purposes through META-SHARE.

Estonian Dependency Treebank: from ConstraintGrammar tagset to Universal DependenciesKadri Muischnek, Kaili Müürisep and Tiina Puolakainen

This paper presents the first version of Estonian Universal

Dependencies Treebank which has been semi-automatically

acquired from Estonian Dependency Treebank and comprises ca

400,000 words (ca 30,000 sentences) representing the genres of

fiction, newspapers and scientific writing. Article analyses the

differences between two annotation schemes and the conversion

procedure to Universal Dependencies format. The conversion

has been conducted by manually created Constraint Grammar

transfer rules. As the rules enable to consider unbounded context,

include lexical information and both flat and tree structure features

at the same time, the method has proved to be reliable and

flexible enough to handle most of transformations. The automatic

conversion procedure achieved LAS 95.2%, UAS 96.3% and LA

98.4%. If punctuation marks were excluded from the calculations,

we observed LAS 96.4%, UAS 97.7% and LA 98.2%. Still

the refinement of the guidelines and methodology is needed

in order to re-annotate some syntactic phenomena, e.g. inter-

clausal relations. Although automatic rules usually make quite

a good guess even in obscure conditions, some relations should be

checked and annotated manually after the main conversion.

The Universal Dependencies Treebank of SpokenSlovenianKaja Dobrovoljc and Joakim Nivre

This paper presents the construction of an open-source

dependency treebank of spoken Slovenian, the first syntactically

annotated collection of spontaneous speech in Slovenian. The

treebank has been manually annotated using the Universal

54

Dependencies annotation scheme, a one-layer syntactic annotation

scheme with a high degree of cross-modality, cross-framework

and cross-language interoperability. In this original application

of the scheme to spoken language transcripts, we address a

wide spectrum of syntactic particularities in speech, either by

extending the scope of application of existing universal labels or

by proposing new speech-specific extensions. The initial analysis

of the resulting treebank and its comparison with the written

Slovenian UD treebank confirms significant syntactic differences

between the two language modalities, with spoken data consisting

of shorter and more elliptic sentences, less and simpler nominal

phrases, and more relations marking disfluencies, interaction,

deixis and modality.

Introducing the Asian Language Treebank (ALT)

Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finchand Eiichiro Sumita

This paper introduces the ALT project initiated by the Advanced

Speech Translation Research and Development Promotion Center

(ASTREC), NICT, Kyoto, Japan. The aim of this project is to

accelerate NLP research for Asian languages such as Indonesian,

Japanese, Khmer, Laos, Malay, Myanmar, Philippine, Thai and

Vietnamese. The original resource for this project was English

articles that were randomly selected from Wikinews. The project

has so far created a corpus for Myanmar and will extend in scope

to include other languages in the near future. A 20000-sentence

corpus of Myanmar that has been manually translated from an

English corpus has been word segmented, word aligned, part-of-

speech tagged and constituency parsed by human annotators. In

this paper, we present the implementation steps for creating the

treebank in detail, including a description of the ALT web-based

treebanking tool. Moreover, we report statistics on the annotation

quality of the Myanmar treebank created so far.

Universal Dependencies for Norwegian

Lilja Øvrelid and Petter Hohle

This article describes the conversion of the Norwegian

Dependency Treebank to the Universal Dependencies scheme.

This paper details the mapping of PoS tags, morphological

features and dependency relations and provides a description of

the structural changes made to NDT analyses in order to make

it compliant with the UD guidelines. We further present PoS

tagging and dependency parsing experiments which report first

results for the processing of the converted treebank. The full

converted treebank was made available with the 1.2 release of the

UD treebanks.

O17 - Language Resource PoliciesThursday, May 26, 9:45

Chairperson: Edouard Geoffrois Oral Session

Fostering the Next Generation of EuropeanLanguage Technology: Recent Developments –Emerging Initiatives – Challenges andOpportunities

Georg Rehm, Jan Hajic, Josef van Genabith and AndrejsVasiljevs

META-NET is a European network of excellence, founded

in 2010, that consists of 60 research centres in 34 European

countries. One of the key visions and goals of META-NET is a

truly multilingual Europe, which is substantially supported and

realised through language technologies. In this article we provide

an overview of recent developments around the multilingual

Europe topic, we also describe recent and upcoming events as well

as recent and upcoming strategy papers. Furthermore, we provide

overviews of two new emerging initiatives, the CEF.AT and ELRC

activity on the one hand and the Cracking the Language Barrier

federation on the other. The paper closes with several suggested

next steps in order to address the current challenges and to open

up new opportunities.

Yes, We Care! Results of the Ethics and NaturalLanguage Processing Surveys

Karën Fort and Alain Couillault

We present here the context and results of two surveys (a French

one and an international one) concerning Ethics and NLP, which

we designed and conducted between June and September 2015.

These surveys follow other actions related to raising concern

for ethics in our community, including a Journée d’études, a

workshop and the Ethics and Big Data Charter. The concern

for ethics shows to be quite similar in both surveys, despite a

few differences which we present and discuss. The surveys also

lead to think there is a growing awareness in the field concerning

ethical issues, which translates into a willingness to get involved

in ethics-related actions, to debate about the topic and to see ethics

be included in major conferences themes. We finally discuss the

limits of the surveys and the means of action we consider for the

future. The raw data from the two surveys are freely available

online.

55

Open Data Vocabularies for Assigning UsageRights to Data Resources from TranslationProjectsDavid Lewis, Kaniz Fatema, Alfredo Maldonado, BrianWalshe and Arturo Calvo

An assessment of the intellectual property requirements for data

used in machine-aided translation is provided based on a recent

EC-funded legal review. This is compared against the capabilities

offered by current linked open data standards from the W3C

for publishing and sharing translation memories from translation

projects, and proposals for adequately addressing the intellectual

property needs of stakeholders in translation projects using open

data vocabularies are suggested.

Language Resource Citation: the ISLRNDissemination and Further DevelopmentsValérie Mapelli, Vladimir Popescu, Lin Liu and KhalidChoukri

This article presents the latest dissemination activities and

technical developments that were carried out for the International

Standard Language Resource Number (ISLRN) service. It also

recalls the main principle and submission process for providers

to obtain their 13-digit ISLRN identifier. Up to March 2016,

2100 Language Resources were allocated an ISLRN number, not

only ELRA’s and LDC’s catalogued Language Resources, but

also the ones from other important organisations like the Joint

Research Centre (JRC) and the Resource Management Agency

(RMA) who expressed their strong support to this initiative.In

the research field, not only assigning a unique identification

number is important, but also referring to a Language Resource

as an object per se (like publications) has now become an

obvious requirement. The ISLRN could also become an important

parameter to be considered to compute a Language Resource

Impact Factor (LRIF) in order to recognize the merits of the

producers of Language Resources. Integrating the ISLRN number

into a LR-oriented bibliographical reference is thus part of the

objective. The idea is to make use of a BibTeX entry that

would take into account Language Resources items, including

ISLRN.The ISLRN being a requested field within the LREC 2016

submission, we expect that several other LRs will be allocated an

ISLRN number by the conference date. With this expansion, this

number aims to be a spreadly-used LR citation instrument within

works referring to LRs.

Trends in HLT Research: A Survey of LDC’s DataScholarship ProgramDenise DiPersio and Christopher Cieri

Since its inception in 2010, the Linguistic Data Consortium’s data

scholarship program has awarded no cost grants in data to 64

recipients from 26 countries. A survey of the twelve cycles to

date – two awards each in the Fall and Spring semesters from

Fall 2010 through Spring 2016 – yields an interesting view into

graduate program research trends in human language technology

and related fields and the particular data sets deemed important to

support that research. The survey also reveals regions in which

such activity appears to be on a rise, including in Arabic-speaking

regions and portions of the Americas and Asia.

O18 - Tweet Corpora and AnalysisThursday, May 26, 9:45

Chairperson: Bernardo Magnini Oral Session

Tweeting and Being Ironic in the Debate about aPolitical Reform: the French Annotated CorpusTWitter-MariagePourTous

Cristina Bosco, Mirko Lai, Viviana Patti and DanielaVirone

The paper introduces a new annotated French data set for

Sentiment Analysis, which is a currently missing resource. It

focuses on the collection from Twitter of data related to the socio-

political debate about the reform of the bill for wedding in France.

The design of the annotation scheme is described, which extends

a polarity label set by making available tags for marking target

semantic areas and figurative language devices. The annotation

process is presented and the disagreement discussed, in particular,

in the perspective of figurative language use and in that of the

semantic oriented annotation, which are open challenges for NLP

systems.

Towards a Corpus of Violence Acts in ArabicSocial Media

Ayman Alhelbawy, Poesio Massimo and Udo Kruschwitz

In this paper we present a new corpus of Arabic tweets that

mention some form of violent event, developed to support the

automatic identification of Human Rights Abuse. The dataset

was manually labelled for seven classes of violence using

crowdsourcing.

TwiSty: A Multilingual Twitter StylometryCorpus for Gender and Personality Profiling

Ben Verhoeven, Walter Daelemans and Barbara Plank

Personality profiling is the task of detecting personality traits of

authors based on writing style. Several personality typologies

exist, however, the Briggs-Myer Type Indicator (MBTI) is

particularly popular in the non-scientific community, and many

people use it to analyse their own personality and talk about the

results online. Therefore, large amounts of self-assessed data

56

on MBTI are readily available on social-media platforms such

as Twitter. We present a novel corpus of tweets annotated with

the MBTI personality type and gender of their author for six

Western European languages (Dutch, German, French, Italian,

Portuguese and Spanish). We outline the corpus creation and

annotation, show statistics of the obtained data distributions and

present first baselines on Myers-Briggs personality profiling and

gender prediction for all six languages.

Twitter as a Lifeline: Human-annotated TwitterCorpora for NLP of Crisis-related Messages

Muhammad Imran, Prasenjit Mitra and Carlos Castillo

Microblogging platforms such as Twitter provide active

communication channels during mass convergence and

emergency events such as earthquakes, typhoons. During the

sudden onset of a crisis situation, affected people post useful

information on Twitter that can be used for situational awareness

and other humanitarian disaster response efforts, if processed

timely and effectively. Processing social media information pose

multiple challenges such as parsing noisy, brief and informal

messages, learning information categories from the incoming

stream of messages and classifying them into different classes

among others. One of the basic necessities of many of these tasks

is the availability of data, in particular human-annotated data. In

this paper, we present human-annotated Twitter corpora collected

during 19 different crises that took place between 2013 and 2015.

To demonstrate the utility of the annotations, we train machine

learning classifiers. Moreover, we publish first largest word2vec

word embeddings trained on 52 million crisis-related tweets. To

deal with tweets language issues, we present human-annotated

normalized lexical resources for different lexical variations.

Functions of Code-Switching in Tweets: AnAnnotation Framework and Some InitialExperiments

Rafiya Begum, Kalika Bali, Monojit Choudhury, KoustavRudra and Niloy Ganguly

Code-Switching (CS) between two languages is extremely

common in communities with societal multilingualism where

speakers switch between two or more languages when interacting

with each other. CS has been extensively studied in

spoken language by linguists for several decades but with the

popularity of social-media and less formal Computer Mediated

Communication, we now see a big rise in the use of CS in

the text form. This poses interesting challenges and a need

for computational processing of such code-switched data. As

with any Computational Linguistic analysis and Natural Language

Processing tools and applications, we need annotated data for

understanding, processing, and generation of code-switched

language. In this study, we focus on CS between English and

Hindi Tweets extracted from the Twitter stream of Hindi-English

bilinguals. We present an annotation scheme for annotating

the pragmatic functions of CS in Hindi-English (Hi-En) code-

switched tweets based on a linguistic analysis and some initial

experiments.

O19 - Dependency TreebanksThursday, May 26, 9:45

Chairperson: Simonetta Montemagni Oral Session

Universal Dependencies for Japanese

Takaaki Tanaka, Yusuke Miyao, Masayuki Asahara, SumireUematsu, Hiroshi Kanayama, Shinsuke Mori and YujiMatsumoto

We present an attempt to port the international syntactic

annotation scheme, Universal Dependencies, to the Japanese

language in this paper. Since the Japanese syntactic structure

is usually annotated on the basis of unique chunk-based

dependencies, we first introduce word-based dependencies by

using a word unit called the Short Unit Word, which usually

corresponds to an entry in the lexicon UniDic. Porting is done

by mapping the part-of-speech tagset in UniDic to the universal

part-of-speech tagset, and converting a constituent-based treebank

to a typed dependency tree. The conversion is not straightforward,

and we discuss the problems that arose in the conversion and the

current solutions. A treebank consisting of 10,000 sentences was

built by converting the existent resources and currently released to

the public.

Universal Dependencies v1: A MultilingualTreebank Collection

Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter,Yoav Goldberg, Jan Hajic, Christopher D. Manning, RyanMcDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira,Reut Tsarfaty and Daniel Zeman

Cross-linguistically consistent annotation is necessary for sound

comparative evaluation and cross-lingual learning experiments.

It is also useful for multilingual system development and

comparative linguistic studies. Universal Dependencies is an

open community effort to create cross-linguistically consistent

treebank annotation for many languages within a dependency-

based lexicalist framework. In this paper, we describe v1 of the

57

universal guidelines, the underlying design principles, and the

currently available treebanks for 33 languages.

Construction of an English Dependency Corpusincorporating Compound Function WordsAkihiko Kato, Hiroyuki Shindo and Yuji Matsumoto

The recognition of multiword expressions (MWEs) in a sentence

is important for such linguistic analyses as syntactic and semantic

parsing, because it is known that combining an MWE into a

single token improves accuracy for various NLP tasks, such

as dependency parsing and constituency parsing. However,

MWEs are not annotated in Penn Treebank. Furthermore, when

converting word-based dependency to MWE-aware dependency

directly, one could combine nodes in an MWE into a single node.

Nevertheless, this method often leads to the following problem:

A node derived from an MWE could have multiple heads and

the whole dependency structure including MWE might be cyclic.

Therefore we converted a phrase structure to a dependency

structure after establishing an MWE as a single subtree. This

approach can avoid an occurrence of multiple heads and/or cycles.

In this way, we constructed an English dependency corpus taking

into account compound function words, which are one type of

MWEs that serve as functional expressions. In addition, we report

experimental results of dependency parsing using a constructed

corpus.

Adapting the TANL tool suite to UniversalDependenciesMaria Simi and Giuseppe Attardi

TANL is a suite of tools for text analytics based on the software

architecture paradigm of data driven pipelines. The strategies

for upgrading TANL to the use of Universal Dependencies range

from a minimalistic approach consisting of introducing pre/post-

processing steps into the native pipeline to revising the whole

pipeline. We explore the issue in the context of the Italian

Treebank, considering both the efforts involved, how to avoid

losing linguistically relevant information and the loss of accuracy

in the process. In particular we compare different strategies for

parsing and discuss the implications of simplifying the pipeline

when detailed part-of-speech and morphological annotations are

not available, as it is the case for less resourceful languages. The

experiments are relative to the Italian linguistic pipeline, but the

use of different parsers in our evaluations and the avoidance of

language specific tagging make the results general enough to be

useful in helping the transition to UD for other languages.

A Dependency Treebank of the Chinese BuddhistCanonTak-sum Wong and John Lee

We present a dependency treebank of the Chinese Buddhist

Canon, which contains 1,514 texts with about 50 million Chinese

characters. The treebank was created by an automatic parser

trained on a smaller treebank, containing four manually annotated

sutras (Lee and Kong, 2014). We report results on word

segmentation, part-of-speech tagging and dependency parsing,

and discuss challenges posed by the processing of medieval

Chinese. In a case study, we exploit the treebank to examine verbs

frequently associated with Buddha, and to analyze usage patterns

of quotative verbs in direct speech. Our results suggest that certain

quotative verbs imply status differences between the speaker and

the listener.

O20 - Word Sense DisambiguationThursday, May 26, 9:45

Chairperson: Nancy Ide Oral Session

Automatic Biomedical Term Polysemy Detection

Juan Antonio Lossio-Ventura, Clement Jonquet, MathieuRoche and Maguelonne Teisseire

Polysemy is the capacity for a word to have multiple meanings.

Polysemy detection is a first step for Word Sense Induction

(WSI), which allows to find different meanings for a term. The

polysemy detection is also important for information extraction

(IE) systems. In addition, the polysemy detection is important

for building/enriching terminologies and ontologies. In this paper,

we present a novel approach to detect if a biomedical term

is polysemic, with the long term goal of enriching biomedical

ontologies. This approach is based on the extraction of new

features. In this context we propose to extract features following

two manners: (i) extracted directly from the text dataset, and (ii)

from an induced graph. Our method obtains an Accuracy and F-

Measure of 0.978.

Cro36WSD: A Lexical Sample for Croatian WordSense Disambiguation

Domagoj Alagic and Jan Šnajder

We introduce Cro36WSD, a freely-available medium-sized

lexical sample for Croatian word sense disambiguation

(WSD).Cro36WSD comprises 36 words: 12 adjectives, 12 nouns,

and 12 verbs, balanced across both frequency bands and polysemy

levels. We adopt the multi-label annotation scheme in the hope

of lessening the drawbacks of discrete sense inventories and

obtaining more realistic annotations from human experts. Sense-

annotated data is collected through multiple annotation rounds to

ensure high-quality annotations: with a 115 person-hours effort

we reached an inter-annotator agreement score of 0.877. We

analyze the obtained data and perform a correlation analysis

58

between several relevant variables, including word frequency,

number of senses, sense distribution skewness, average annotation

time, and the observed inter-annotator agreement (IAA). Using

the obtained data, we compile multi- and single-labeled dataset

variants using different label aggregation schemes. Finally, we

evaluate three different baseline WSD models on both dataset

variants and report on the insights gained. We make both dataset

variants freely available.

Addressing the MFS Bias in WSD systems

Marten Postma, Ruben Izquierdo, Eneko Agirre, GermanRigau and Piek Vossen

Word Sense Disambiguation (WSD) systems tend to have a

strong bias towards assigning the Most Frequent Sense (MFS),

which results in high performance on the MFS but in a very

low performance on the less frequent senses. We addressed the

MFS bias in WSD systems by combining the output from a

WSD system with a set of mostly static features to create a MFS

classifier to decide when to and not to choose the MFS. The output

from this MFS classifier, which is based on the Random Forest

algorithm, is then used to modify the output from the original

WSD system. We applied our classifier to one of the state-of-the-

art supervised WSD systems, i.e. IMS, and to of the best state-of-

the-art unsupervised WSD systems, i.e. UKB. Our main finding is

that we are able to improve the system output in terms of choosing

between the MFS and the less frequent senses. When we apply the

MFS classifier to fine-grained WSD, we observe an improvement

on the less frequent sense cases, whereas we maintain the overall

recall.

A Large-Scale Multilingual Disambiguation ofGlosses

José Camacho-Collados, Claudio Delli Bovi, AlessandroRaganato and Roberto Navigli

Linking concepts and named entities to knowledge bases has

become a crucial Natural Language Understanding task. In

this respect, recent works have shown the key advantage

of exploiting textual definitions in various Natural Language

Processing applications. However, to date there are no reliable

large-scale corpora of sense-annotated textual definitions available

to the research community. In this paper we present a large-

scale high-quality corpus of disambiguated glosses in multiple

languages, comprising sense annotations of both concepts and

named entities from a unified sense inventory. Our approach

for the construction and disambiguation of the corpus builds

upon the structure of a large multilingual semantic network

and a state-of-the-art disambiguation system; first, we gather

complementary information of equivalent definitions across

different languages to provide context for disambiguation, and

then we combine it with a semantic similarity-based refinement.

As a result we obtain a multilingual corpus of textual definitions

featuring over 38 million definitions in 263 languages, and

we make it freely available at http://lcl.uniroma1.

it/disambiguated-glosses. Experiments on Open

Information Extraction and Sense Clustering show how two state-

of-the-art approaches improve their performance by integrating

our disambiguated corpus into their pipeline.

Unsupervised Ranked Cross-Lingual LexicalSubstitution for Low-Resource Languages

Stefan Ecker, Andrea Horbach and Stefan Thater

We propose an unsupervised system for a variant of cross-lingual

lexical substitution (CLLS) to be used in a reading scenario in

computer-assisted language learning (CALL), in which single-

word translations provided by a dictionary are ranked according

to their appropriateness in context. In contrast to most alternative

systems, ours does not rely on either parallel corpora or machine

translation systems, making it suitable for low-resource languages

as the language to be learned. This is achieved by a graph-based

scoring mechanism which can deal with ambiguous translations

of context words provided by a dictionary. Due to this

decoupling from the source language, we need monolingual

corpus resources only for the target language, i.e. the language

of the translation candidates. We evaluate our approach for

the language pair Norwegian Nynorsk-English on an exploratory

manually annotated gold standard and report promising results.

When running our system on the original SemEval CLLS task,

we rank 6th out of 18 (including 2 baselines and our 2 system

variants) in the best evaluation.

P19 - Discourse (2)Thursday, May 26, 9:45

Chairperson: Olga Uryupina Poster Session

Information structure in the PotsdamCommentary Corpus: Topics

Manfred Stede and Sara Mamprin

The Potsdam Commentary Corpus is a collection of 175 German

newspaper commentaries annotated on a variety of different

layers. This paper introduces a new layer that covers the linguistic

notion of information-structural topic (not to be confused with

‘topic’ as applied to documents in information retrieval). To our

knowledge, this is the first larger topic-annotated resource for

German (and one of the first for any language). We describe the

annotation guidelines and the annotation process, and the results

of an inter-annotator agreement study, which compare favourably

59

to the related work. The annotated corpus is freely available for

research.

A Corpus of Clinical Practice GuidelinesAnnotated with the Importance ofRecommendations

Jonathon Read, Erik Velldal, Marc Cavazza and GersendeGeorg

In this paper we present the Corpus of REcommendation

STrength (CREST), a collection of HTML-formatted clinical

guidelines annotated with the location of recommendations.

Recommendations are labelled with an author-provided indicator

of their strength of importance. As data was drawn from many

disparate authors, we define a unified scheme of importance

labels, and provide a mapping for each guideline. We

demonstrate the utility of the corpus and its annotations in

some initial measurements investigating the type of language

constructions associated with strong and weak recommendations,

and experiments into promising features for recommendation

classification, both with respect to strong and weak labels, and to

all labels of the unified scheme. An error analysis indicates that,

while there is a strong relationship between lexical choices and

strength labels, there can be substantial variance in the choices

made by different authors.

The Methodius Corpus of Rhetorical DiscourseStructures and Generated Texts

Amy Isard

Using the Methodius Natural Language Generation (NLG)

System, we have created a corpus which consists of a collection

of generated texts which describe ancient Greek artefacts. Each

text is linked to two representations created as part of the NLG

process. The first is a content plan, which uses rhetorical relations

to describe the high-level discourse structure of the text, and the

second is a logical form describing the syntactic structure, which

is sent to the OpenCCG surface realization module to produce

the final text output. In recent work, White and Howcroft (2015)

have used the SPaRKy restaurant corpus, which contains similar

combination of texts and representations, for their research on the

induction of rules for the combination of clauses. In the first

instance this corpus will be used to test their algorithms on an

additional domain, and extend their work to include the learning

of referring expression generation rules. As far as we know, the

SPaRKy restaurant corpus is the only existing corpus of this type,

and we hope that the creation of this new corpus in a different

domain will provide a useful resource to the Natural Language

Generation community.

Applying Core Scientific Concepts toContext-Based Citation RecommendationDaniel Duma, Maria Liakata, Amanda Clare, JamesRavenscroft and Ewan Klein

The task of recommending relevant scientific literature for a draft

academic paper has recently received significant interest. In our

effort to ease the discovery of scientific literature and augment

scientific writing, we aim to improve the relevance of results

based on a shallow semantic analysis of the source document

and the potential documents to recommend. We investigate the

utility of automatic argumentative and rhetorical annotation of

documents for this purpose. Specifically, we integrate automatic

Core Scientific Concepts (CoreSC) classification into a prototype

context-based citation recommendation system and investigate its

usefulness to the task. We frame citation recommendation as

an information retrieval task and we use the categories of the

annotation schemes to apply different weights to the similarity

formula. Our results show interesting and consistent correlations

between the type of citation and the type of sentence containing

the relevant information.

SciCorp: A Corpus of English Scientific ArticlesAnnotated for Information Status AnalysisIna Roesiger

This paper presents SciCorp, a corpus of full-text English

scientific papers of two disciplines, genetics and computational

linguistics. The corpus comprises co-reference and bridging

information as well as information status labels. Since SciCorp

is annotated with both labels and the respective co-referent and

bridging links, we believe it is a valuable resource for NLP

researchers working on scientific articles or on applications such

as co-reference resolution, bridging resolution or information

status classification. The corpus has been reliably annotated

by independent human coders with moderate inter-annotator

agreement (average kappa = 0.71). In total, we have annotated

14 full papers containing 61,045 tokens and marked 8,708 definite

noun phrases. The paper describes in detail the annotation scheme

as well as the resulting corpus. The corpus is available for

download in two different formats: in an offset-based format

and for the co-reference annotations in the widely-used, tabular

CoNLL-2012 format.

Using lexical and Dependency Features toDisambiguate Discourse Connectives in HindiRohit Jain, Himanshu Sharma and Dipti Sharma

Discourse parsing is a challenging task in NLP and plays a

crucial role in discourse analysis. To enable discourse analysis

60

for Hindi, Hindi Discourse Relations Bank was created on a

subset of Hindi TreeBank. The benefits of a discourse analyzer

in automated discourse analysis, question summarization and

question answering domains has motivated us to begin work

on a discourse analyzer for Hindi. In this paper, we focus on

discourse connective identification for Hindi. We explore various

available syntactic features for this task. We also explore the

use of dependency tree parses present in the Hindi TreeBank

and study the impact of the same on the performance of the

system. We report that the novel dependency features introduced

have a higher impact on precision, in comparison to the syntactic

features previously used for this task. In addition, we report a high

accuracy of 96% for this task.

Annotating Topic Development in InformationSeeking Queries

Marta Andersson, Adnan Ozturel and Silvia Pareti

This paper contributes to the limited body of empirical research in

the domain of discourse structure of information seeking queries.

We describe the development of an annotation schema for coding

topic development in information seeking queries and the initial

observations from a pilot sample of query sessions. The main

idea that we explore is the relationship between constant and

variable discourse entities and their role in tracking changes in the

topic progression. We argue that the topicalized entities remain

stable across development of the discourse and can be identified

by a simple mechanism where anaphora resolution is a precursor.

We also claim that a corpus annotated in this framework can be

used as training data for dialogue management and computational

semantics systems.

Searching in the Penn Discourse Treebank Usingthe PML-Tree Query

Jirí Mírovský, Lucie Poláková and Jan Štepánek

The PML-Tree Query is a general, powerful and user-friendly

system for querying richly linguistically annotated treebanks.

The paper shows how the PML-Tree Query can be used for

searching for discourse relations in the Penn Discourse Treebank

2.0 mapped onto the syntactic annotation of the Penn Treebank.

The OpenCourseWare Metadiscourse (OCWMD)Corpus

Ghada Alharbi and Thomas Hain

This study describes a new corpus of over 60,000 hand-annotated

metadiscourse acts from 106 OpenCourseWare lectures, from two

different disciplines: Physics and Economics. Metadiscourse is a

set of linguistic expressions that signal different functions in the

discourse. This type of language is hypothesised to be helpful in

finding a structure in unstructured text, such as lectures discourse.

A brief summary is provided about the annotation scheme and

labelling procedures, inter-annotator reliability statistics, overall

distributional statistics, a description of auxiliary data that will

be distributed with the corpus, and information relating to how

to obtain the data. The results provide a deeper understanding of

lecture structure and confirm the reliable coding of metadiscursive

acts in academic lectures across different disciplines. The next

stage of our research will be to build a classification model to

automate the tagging process, instead of manual annotation, which

take time and efforts. This is in addition to the use of these tags as

indicators of the higher level structure of lecture discourse.

Ubuntu-fr: A Large and Open Corpus forMulti-modal Analysis of Online WrittenConversations

Nicolas Hernandez, Soufian Salim and Elizaveta LoginovaClouet

We present a large, free, French corpus of online written

conversations extracted from the Ubuntu platform’s forums,

mailing lists and IRC channels. The corpus is meant to

support multi-modality and diachronic studies of online written

conversations. We choose to build the corpus around a robust

metadata model based upon strong principles, such as the “stand

off” annotation principle. We detail the model, we explain how

the data was collected and processed - in terms of meta-data, text

and conversation - and we detail the corpus’contents through a

series of meaningful statistics. A portion of the corpus - about

4,700 sentences from emails, forum posts and chat messages

sent in November 2014 - is annotated in terms of dialogue acts

and sentiment. We discuss how we adapted our dialogue act

taxonomy from the DIT++ annotation scheme and how the data

was annotated, before presenting our results as well as a brief

qualitative analysis of the annotated data.

DUEL: A Multi-lingual Multimodal DialogueCorpus for Disfluency, Exclamations andLaughter

Julian Hough, Ye Tian, Laura de Ruiter, Simon Betz, SpyrosKousidis, David Schlangen and Jonathan Ginzburg

We present the DUEL corpus, consisting of 24 hours of natural,

face-to-face, loosely task-directed dialogue in German, French

and Mandarin Chinese. The corpus is uniquely positioned as

a cross-linguistic, multimodal dialogue resource controlled for

domain. DUEL includes audio, video and body tracking data

61

and is transcribed and annotated for disfluency, laughter and

exclamations.

P20 - Document Classification and TextCategorisation (1)Thursday, May 26, 9:45

Chairperson: Fabio Tamburini Poster Session

Character-Level Neural Translation forMultilingual Media Monitoring in the SUMMAProject

Guntis Barzdins, Steve Renals and Didzis Gosko

The paper steps outside the comfort-zone of the traditional NLP

tasks like automatic speech recognition (ASR) and machine

translation (MT) to addresses two novel problems arising in the

automated multilingual news monitoring: segmentation of the

TV and radio program ASR transcripts into individual stories,

and clustering of the individual stories coming from various

sources and languages into storylines. Storyline clustering

of stories covering the same events is an essential task for

inquisitorial media monitoring. We address these two problems

jointly by engaging the low-dimensional semantic representation

capabilities of the sequence to sequence neural translation

models. To enable joint multi-task learning for multilingual

neural translation of morphologically rich languages we replace

the attention mechanism with the sliding-window mechanism

and operate the sequence to sequence neural translation model

on the character-level rather than on the word-level. The

story segmentation and storyline clustering problem is tackled

by examining the low-dimensional vectors produced as a side-

product of the neural translation process. The results of this paper

describe a novel approach to the automatic story segmentation and

storyline clustering problem.

Exploring the Realization of Irony in Twitter Data

Cynthia Van Hee, Els Lefever and Veronique Hoste

Handling figurative language like irony is currently a challenging

task in natural language processing. Since irony is commonly

used in user-generated content, its presence can significantly

undermine accurate analysis of opinions and sentiment in such

texts. Understanding irony is therefore important if we want to

push the state-of-the-art in tasks such as sentiment analysis. In this

research, we present the construction of a Twitter dataset for two

languages, being English and Dutch, and the development of new

guidelines for the annotation of verbal irony in social media texts.

Furthermore, we present some statistics on the annotated corpora,

from which we can conclude that the detection of contrasting

evaluations might be a good indicator for recognizing irony.

Discriminating Similar Languages: Evaluationsand Explorations

Cyril Goutte, Serge Léger, Shervin Malmasi and MarcosZampieri

We present an analysis of the performance of machine learning

classifiers on discriminating between similar languages and

language varieties. We carried out a number of experiments using

the results of the two editions of the Discriminating between

Similar Languages (DSL) shared task. We investigate the progress

made between the two tasks, estimate an upper bound on possible

performance using ensemble and oracle combination, and provide

learning curves to help us understand which languages are more

challenging. A number of difficult sentences are identified and

investigated further with human annotation

Compilation of an Arabic Children’s Corpus

Latifa Al-Sulaiti, Noorhan Abbas, Claire Brierley, EricAtwell and Ayman Alghamdi

Inspired by the Oxford Children’s Corpus, we have developed

a prototype corpus of Arabic texts written and/or selected for

children. Our Arabic Children’s Corpus of 2950 documents and

nearly 2 million words has been collected manually from the web

during a 3-month project. It is of high quality, and contains a range

of different children’s genres based on sources located, including

classic tales from The Arabian Nights, and popular fictional

characters such as Goha. We anticipate that the current and

subsequent versions of our corpus will lead to interesting studies

in text classification, language use, and ideology in children’s

texts.

Quality Assessment of the Reuters Vol. 2Multilingual Corpus

Robin Eriksson

We introduce a framework for quality assurance of corpora, and

apply it to the Reuters Multilingual Corpus (RCV2). The results

of this quality assessment of this standard newsprint corpus reveal

a significant duplication problem and, to a lesser extent, a problem

with corrupted articles. From the raw collection of some 487,000

articles, almost one tenth are trivial duplicates. A smaller fraction

of articles appear to be corrupted and should be excluded for

that reason. The detailed results are being made available as

on-line appendices to this article. This effort also demonstrates

the beginnings of a constraint-based methodological framework

for quality assessment and quality assurance for corpora. As

a first implementation of this framework, we have investigated

62

constraints to verify sample integrity, and to diagnose sample

duplication, entropy aberrations, and tagging inconsistencies. To

help identify near-duplicates in the corpus, we have employed

both entropy measurements and a simple byte bigram incidence

digest.

Learning Tone and Attribution for Financial TextMining

Mahmoud El-Haj, Paul Rayson, Steve Young, AndrewMoore, Martin Walker, Thomas Schleicher and VasilikiAthanasakou

Attribution bias refers to the tendency of people to attribute

successes to their own abilities but failures to external factors. In

a business context an internal factor might be the restructuring of

the firm and an external factor might be an unfavourable change

in exchange or interest rates. In accounting research, the presence

of an attribution bias has been demonstrated for the narrative

sections of the annual financial reports. Previous studies have

applied manual content analysis to this problem but in this paper

we present novel work to automate the analysis of attribution bias

through using machine learning algorithms. Previous studies have

only applied manual content analysis on a small scale to reveal

such a bias in the narrative section of annual financial reports. In

our work a group of experts in accounting and finance labelled

and annotated a list of 32,449 sentences from a random sample of

UK Preliminary Earning Announcements (PEAs) to allow us to

examine whether sentences in PEAs contain internal or external

attribution and which kinds of attributions are linked to positive

or negative performance. We wished to examine whether human

annotators could agree on coding this difficult task and whether

Machine Learning (ML) could be applied reliably to replicate the

coding process on a much larger scale. Our best machine learning

algorithm correctly classified performance sentences with 70%

accuracy and detected tone and attribution in financial PEAs with

accuracy of 79%.

A Comparative Study of Text PreprocessingApproaches for Topic Detection of UserUtterances

Roman Sergienko, Muhammad Shan and Wolfgang Minker

The paper describes a comparative study of existing and novel text

preprocessing and classification techniques for domain detection

of user utterances. Two corpora are considered. The first

one contains customer calls to a call centre for further call

routing; the second one contains answers of call centre employees

with different kinds of customer orientation behaviour. Seven

different unsupervised and supervised term weighting methods

were applied. The collective use of term weighting methods

is proposed for classification effectiveness improvement. Four

different dimensionality reduction methods were applied: stop-

words filtering with stemming, feature selection based on term

weights, feature transformation based on term clustering, and a

novel feature transformation method based on terms belonging to

classes. As classification algorithms we used k-NN and a SVM-

based algorithm. The numerical experiments have shown that the

simultaneous use of the novel proposed approaches (collectives

of term weighting methods and the novel feature transformation

method) allows reaching the high classification results with very

small number of features.

UPPC - Urdu Paraphrase Plagiarism Corpus

Muhammad Sharjeel, Paul Rayson and Rao MuhammadAdeel Nawab

Paraphrase plagiarism is a significant and widespread problem

and research shows that it is hard to detect. Several methods and

automatic systems have been proposed to deal with it. However,

evaluation and comparison of such solutions is not possible

because of the unavailability of benchmark corpora with manual

examples of paraphrase plagiarism. To deal with this issue, we

present the novel development of a paraphrase plagiarism corpus

containing simulated (manually created) examples in the Urdu

language - a language widely spoken around the world. This

resource is the first of its kind developed for the Urdu language and

we believe that it will be a valuable contribution to the evaluation

of paraphrase plagiarism detection systems.

Identifying Content Types of Messages Related toOpen Source Software Projects

Yannis Korkontzelos, Paul Thompson and SophiaAnaniadou

Assessing the suitability of an Open Source Software project for

adoption requires not only an analysis of aspects related to the

code, such as code quality, frequency of updates and new version

releases, but also an evaluation of the quality of support offered

in related online forums and issue trackers. Understanding the

content types of forum messages and issue trackers can provide

information about the extent to which requests are being addressed

and issues are being resolved, the percentage of issues that are

not being fixed, the cases where the user acknowledged that the

issue was successfully resolved, etc. These indicators can provide

potential adopters of the OSS with estimates about the level of

available support. We present a detailed hierarchy of content

types of online forum messages and issue tracker comments

and a corpus of messages annotated accordingly. We discuss

our experiments to classify forum messages and issue tracker

63

comments into content-related classes, i.e. to assign them to nodes

of the hierarchy. The results are very encouraging.

Emotion Corpus Construction Based on Selectionfrom HashtagsMinglei Li, Yunfei Long, Lu Qin and Wenjie Li

The availability of labelled corpus is of great importance for

supervised learning in emotion classification tasks. Because it is

time-consuming to manually label text, hashtags have been used

as naturally annotated labels to obtain a large amount of labelled

training data from microblog. However, natural hashtags contain

too much noise for it to be used directly in learning algorithms.

In this paper, we design a three-stage semi-automatic method to

construct an emotion corpus from microblogs. Firstly, a lexicon

based voting approach is used to verify the hashtag automatically.

Secondly, a SVM based classifier is used to select the data whose

natural labels are consistent with the predicted labels. Finally,

the remaining data will be manually examined to filter out the

noisy data. Out of about 48K filtered Chinese microblogs, 39k

microblogs are selected to form the final corpus with the Kappa

value reaching over 0.92 for the automatic parts and over 0.81 for

the manual part. The proportion of automatic selection reaches

54.1%. Thus, the method can reduce about 44.5% of manual

workload for acquiring quality data. Experiment on a classifier

trained on this corpus shows that it achieves comparable results

compared to the manually annotated NLP&CC2013 corpus.

P21 - Evaluation Methodologies (2)Thursday, May 26, 9:45

Chairperson: António Branco Poster Session

Comparing the Level of Code-Switching inCorporaBjörn Gambäck and Amitava Das

Social media texts are often fairly informal and conversational,

and when produced by bilinguals tend to be written in

several different languages simultaneously, in the same way as

conversational speech. The recent availability of large social

media corpora has thus also made large-scale code-switched

resources available for research. The paper addresses the issues of

evaluation and comparison these new corpora entail, by defining

an objective measure of corpus level complexity of code-switched

texts. It is also shown how this formal measure can be used in

practice, by applying it to several code-switched corpora.

Evaluation of the KIT Lecture Translation SystemMarkus Müller, Sarah Fünfer, Sebastian Stüker and AlexWaibel

To attract foreign students is among the goals of the Karlsruhe

Institute of Technology (KIT). One obstacle to achieving this goal

is that lectures at KIT are usually held in German which many

foreign students are not sufficiently proficient in, as, e.g., opposed

to English. While the students from abroad are learning German

during their stay at KIT, it is challenging to become proficient

enough in it in order to follow a lecture. As a solution to this

problem we offer our automatic simultaneous lecture translation.

It translates German lectures into English in real time. While

not as good as human interpreters, the system is available at a

price that KIT can afford in order to offer it in potentially all

lectures. In order to assess whether the quality of the system we

have conducted a user study. In this paper we present this study,

the way it was conducted and its results. The results indicate that

the quality of the system has passed a threshold as to be able to

support students in their studies. The study has helped to identify

the most crucial weaknesses of the systems and has guided us

which steps to take next.

The ACL RD-TEC 2.0: A Language Resource forEvaluating Term Extraction and EntityRecognition Methods

Behrang QasemiZadeh and Anne-Kathrin Schumann

This paper introduces the ACL Reference Dataset for Terminology

Extraction and Classification, version 2.0 (ACL RD-TEC 2.0).

The ACL RD-TEC 2.0 has been developed with the aim of

providing a benchmark for the evaluation of term and entity

recognition tasks based on specialised text from the computational

linguistics domain. This release of the corpus consists of 300

abstracts from articles in the ACL Anthology Reference Corpus,

published between 1978–2006. In these abstracts, terms (i.e.,

single or multi-word lexical units with a specialised meaning)

are manually annotated. In addition to their boundaries in

running text, annotated terms are classified into one of the seven

categories method, tool, language resource (LR), LR product,

model, measures and measurements, and other. To assess the

quality of the annotations and to determine the difficulty of this

annotation task, more than 171 of the abstracts are annotated

twice, independently, by each of the two annotators. In total, 6,818

terms are identified and annotated in more than 1300 sentences,

resulting in a specialised vocabulary made of 3,318 lexical forms,

mapped to 3,471 concepts. We explain the development of the

annotation guidelines and discuss some of the challenges we

encountered in this annotation task.

Building an Arabic Machine TranslationPost-Edited Corpus: Guidelines and Annotation

Wajdi Zaghouani, Nizar Habash, Ossama Obeid, BehrangMohit, Houda Bouamor and Kemal Oflazer

We present our guidelines and annotation procedure to create a

human corrected machine translated post-edited corpus for the

64

Modern Standard Arabic. Our overarching goal is to use the

annotated corpus to develop automatic machine translation post-

editing systems for Arabic that can be used to help accelerate

the human revision process of translated texts. The creation of

any manually annotated corpus usually presents many challenges.

In order to address these challenges, we created comprehensive

and simplified annotation guidelines which were used by a team

of five annotators and one lead annotator. In order to ensure

a high annotation agreement between the annotators, multiple

training sessions were held and regular inter-annotator agreement

measures were performed to check the annotation quality. The

created corpus of manual post-edited translations of English to

Arabic articles is the largest to date for this language pair.

Tools and Guidelines for Principled MachineTranslation Development

Nora Aranberri, Eleftherios Avramidis, AljoschaBurchardt, Ondrej Klejch, Martin Popel and Maja Popovic

This work addresses the need to aid Machine Translation (MT)

development cycles with a complete workflow of MT evaluation

methods. Our aim is to assess, compare and improve MT

system variants. We hereby report on novel tools and practices

that support various measures, developed in order to support

a principled and informed approach of MT development. Our

toolkit for automatic evaluation showcases quick and detailed

comparison of MT system variants through automatic metrics and

n-gram feedback, along with manual evaluation via edit-distance,

error annotation and task-based feedback.

Generating Task-Pertinent sorted Error Lists forSpeech Recognition

Olivier Galibert, Mohamed Ameur Ben Jannet, JulietteKahn and Sophie Rosset

Automatic Speech recognition (ASR) is one of the most widely

used components in spoken language processing applications.

ASR errors are of varying importance with respect to the

application, making error analysis keys to improving speech

processing applications. Knowing the most serious errors for

the applicative case is critical to build better systems. In the

context of Automatic Speech Recognition (ASR) used as a first

step towards Named Entity Recognition (NER) in speech, error

seriousness is usually determined by their frequency, due to the

use of the WER as metric to evaluate the ASR output, despite the

emergence of more relevant measures in the literature. We propose

to use a different evaluation metric form the literature in order to

classify ASR errors according to their seriousness for NER. Our

results show that the ASR errors importance is ranked differently

depending on the used evaluation metric. A more detailed analysis

shows that the estimation of the error impact given by the ATENE

metric is more adapted to the NER task than the estimation based

only on the most used frequency metric WER.

P22 - Information Extraction and Retrieval(2)Thursday, May 26, 9:45

Chairperson: Robert Gaizauskas Poster Session

A Study of Reuse and Plagiarism in LREC papers

Gil Francopoulo, Joseph Mariani and Patrick Paroubek

The aim of this experiment is to present an easy way to compare

fragments of texts in order to detect (supposed) results of copy

& paste operations between articles in the domain of Natural

Language Processing (NLP). The search space of the comparisons

is a corpus labeled as NLP4NLP gathering a large part of the NLP

field. The study is centered on LREC papers in both directions,

first with an LREC paper borrowing a fragment of text from the

collection, and secondly in the reverse direction with fragments of

LREC documents borrowed and inserted in the collection.

Developing a Dataset for Evaluating Approachesfor Document Expansion with Images

Debasis Ganguly, Iacer Calixto and Gareth Jones

Motivated by the adage that a “picture is worth a thousand words”

it can be reasoned that automatically enriching the textual content

of a document with relevant images can increase the readability

of a document. Moreover, features extracted from the additional

image data inserted into the textual content of a document may, in

principle, be also be used by a retrieval engine to better match the

topic of a document with that of a given query. In this paper, we

describe our approach of building a ground truth dataset to enable

further research into automatic addition of relevant images to text

documents. The dataset is comprised of the official ImageCLEF

2010 collection (a collection of images with textual metadata) to

serve as the images available for automatic enrichment of text, a

set of 25 benchmark documents that are to be enriched, which in

this case are children’s short stories, and a set of manually judged

relevant images for each query story obtained by the standard

procedure of depth pooling. We use this benchmark dataset

to evaluate the effectiveness of standard information retrieval

methods as simple baselines for this task. The results indicate that

using the whole story as a weighted query, where the weight of

65

each query term is its tf-idf value, achieves an precision of 0:1714

within the top 5 retrieved images on an average.

More than Word Cooccurrence: ExploringSupport and Opposition in International ClimateNegotiations with Semantic ParsingPablo Ruiz, Clément Plancq and Thierry Poibeau

Text analysis methods widely used in digital humanities often

involve word co-occurrence, e.g. concept co-occurrence

networks. These methods provide a useful corpus overview,

but cannot determine the predicates that relate co-occurring

concepts. Our goal was identifying propositions expressing

the points supported or opposed by participants in international

climate negotiations. Word co-occurrence methods were not

sufficient, and an analysis based on open relation extraction had

limited coverage for nominal predicates. We present a pipeline

which identifies the points that different actors support and

oppose, via a domain model with support/opposition predicates,

and analysis rules that exploit the output of semantic role

labelling, syntactic dependencies and anaphora resolution. Entity

linking and keyphrase extraction are also performed on the

propositions related to each actor. A user interface allows

examining the main concepts in points supported or opposed

by each participant, which participants agree or disagree with

each other, and about which issues. The system is an example

of tools that digital humanities scholars are asking for, to

render rich textual information (beyond word co-occurrence) more

amenable to quantitative treatment. An evaluation of the tool was

satisfactory.

A Sequence Model Approach to RelationExtraction in PortugueseSandra Collovini, Gabriel Machado and Renata Vieira

The task of Relation Extraction from texts is one of the main

challenges in the area of Information Extraction, considering

the required linguistic knowledge and the sophistication of the

language processing techniques employed. This task aims

at identifying and classifying semantic relations that occur

between entities recognized in a given text. In this paper,

we evaluated a Conditional Random Fields classifier for the

extraction of any relation descriptor occurring between named

entities (Organisation, Person and Place categories), as well as

pre-defined relation types between these entities in Portuguese

texts.

Evaluation Set for Slovak News InformationRetrievalDaniel Hládek, Ján Staš and Jozef Juhár

This work proposes an information retrieval evaluation set for

the Slovak language. A set of 80 queries written in the natural

language is given together with the set of relevant documents.

The document set contains 3980 newspaper articles sorted into 6

categories. Each document in the result set is manually annotated

for relevancy with its corresponding query. The evaluation set

is mostly compatible with the Cranfield test collection using the

same methodology for queries and annotation of relevancy. In

addition to that it provides annotation for document title, author,

publication date and category that can be used for evaluation of

automatic document clustering and categorization.

Analyzing Time Series Changes of Correlationbetween Market Share and Concerns onCompanies measured through Search EngineSuggests

Takakazu Imada, Yusuke Inoue, Lei Chen, Syunya Doi, TianNie, Chen Zhao, Takehito Utsuro and Yasuhide Kawada

This paper proposes how to utilize a search engine in order to

predict market shares. We propose to compare rates of concerns of

those who search for Web pages among several companies which

supply products, given a specific products domain. We measure

concerns of those who search for Web pages through search

engine suggests. Then, we analyze whether rates of concerns

of those who search for Web pages have certain correlation with

actual market share. We show that those statistics have certain

correlations. We finally propose how to predict the market share

of a specific product genre based on the rates of concerns of those

who search for Web pages.

TermITH-Eval: a French Standard-BasedResource for Keyphrase Extraction Evaluation

Adrien Bougouin, Sabine Barreaux, Laurent Romary,Florian Boudin and Beatrice Daille

Keyphrase extraction is the task of finding phrases that represent

the important content of a document. The main aim of keyphrase

extraction is to propose textual units that represent the most

important topics developed in a document. The output keyphrases

of automatic keyphrase extraction methods for test documents

are typically evaluated by comparing them to manually assigned

reference keyphrases. Each output keyphrase is considered

correct if it matches one of the reference keyphrases. However,

the choice of the appropriate textual unit (keyphrase) for a

topic is sometimes subjective and evaluating by exact matching

underestimates the performance. This paper presents a dataset of

evaluation scores assigned to automatically extracted keyphrases

by human evaluators. Along with the reference keyphrases,

the manual evaluations can be used to validate new evaluation

measures. Indeed, an evaluation measure that is highly correlated

66

to the manual evaluation is appropriate for the evaluation of

automatic keyphrase extraction methods.

The Royal Society Corpus: From Uncharted Datato Corpus

Hannah Kermes, Stefania Degaetano-Ortlieb, AshrafKhamis, Jörg Knappen and Elke Teich

We present the Royal Society Corpus (RSC) built from the

Philosophical Transactions and Proceedings of the Royal Society

of London. At present, the corpus contains articles from the

first two centuries of the journal (1665–1869) and amounts to

around 35 million tokens. The motivation for building the RSC

is to investigate the diachronic linguistic development of scientific

English. Specifically, we assume that due to specialization,

linguistic encodings become more compact over time (Halliday,

1988; Halliday and Martin, 1993), thus creating a specific

discourse type characterized by high information density that is

functional for expert communication. When building corpora

from uncharted material, typically not all relevant meta-data

(e.g. author, time, genre) or linguistic data (e.g. sentence/word

boundaries, words, parts of speech) is readily available. We

present an approach to obtain good quality meta-data and base

text data adopting the concept of Agile Software Development.

Building Evaluation Datasets forConsumer-Oriented Information Retrieval

Lorraine Goeuriot, Liadh Kelly, Guido Zuccon and JoaoPalotti

Common people often experience difficulties in accessing

relevant, correct, accurate and understandable health information

online. Developing search techniques that aid these information

needs is challenging. In this paper we present the datasets created

by CLEF eHealth Lab from 2013-2015 for evaluation of search

solutions to support common people finding health information

online. Specifically, the CLEF eHealth information retrieval

(IR) task of this Lab has provided the research community with

benchmarks for evaluating consumer-centered health information

retrieval, thus fostering research and development aimed to

address this challenging problem. Given consumer queries, the

goal of the task is to retrieve relevant documents from the provided

collection of web pages. The shared datasets provide a large health

web crawl, queries representing people’s real world information

needs, and relevance assessment judgements for the queries.

A Dataset for Open Event Extraction in English

Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret andRomaric Besançon

This article presents a corpus for development and testing of event

schema induction systems in English. Schema induction is the

task of learning templates with no supervision from unlabeled

texts, and to group together entities corresponding to the same

role in a template. Most of the previous work on this subject relies

on the MUC-4 corpus. We describe the limits of using this corpus

(size, non-representativeness, similarity of roles across templates)

and propose a new, partially-annotated corpus in English which

remedies some of these shortcomings. We make use of Wikinews

to select the data inside the category Laws & Justice, and query

Google search engine to retrieve different documents on the

same events. Only Wikinews documents are manually annotated

and can be used for evaluation, while the others can be used

for unsupervised learning. We detail the methodology used for

building the corpus and evaluate some existing systems on this

new data.

P23 - Prosody and PhonologyThursday, May 26, 9:45

Chairperson: Björn Schuller Poster Session

Phoneme Alignment Using the Information onPhonological Processes in Continuous Speech

Daniil Kocharov

The current study focuses on optimization of Levenshtein

algorithm for the purpose of computing the optimal alignment

between two phoneme transcriptions of spoken utterance

containing sequences of phonetic symbols. The alignment is

computed with the help of a confusion matrix in which costs for

phonetic symbol deletion, insertion and substitution are defined

taking into account various phonological processes that occur in

fluent speech, such as anticipatory assimilation, phone elision and

epenthesis. The corpus containing about 30 hours of Russian

read speech was used to evaluate the presented algorithms.

The experimental results have shown significant reduction of

misalignment rate in comparison with the baseline Levenshtein

algorithm: the number of errors has been reduced from 1.1 % to

0.28 %

CoRuSS - a New Prosodically Annotated Corpusof Russian Spontaneous Speech

Tatiana Kachkovskaia, Daniil Kocharov, Pavel Skrelin andNina Volskaya

This paper describes speech data recording, processing and

annotation of a new speech corpus CoRuSS (Corpus of

Russian Spontaneous Speech), which is based on connected

communicative speech recorded from 60 native Russian male and

female speakers of different age groups (from 16 to 77). Some

Russian speech corpora available at the moment contain plain

orthographic texts and provide some kind of limited annotation,

67

but there are no corpora providing detailed prosodic annotation of

spontaneous conversational speech. This corpus contains 30 hours

of high quality recorded spontaneous Russian speech, half of it has

been transcribed and prosodically labeled. The recordings consist

of dialogues between two speakers, monologues (speakers’ self-

presentations) and reading of a short phonetically balanced text.

Since the corpus is labeled for a wide range of linguistic - phonetic

and prosodic - information, it provides basis for empirical

studies of various spontaneous speech phenomena as well as for

comparison with those we observe in prepared read speech. Since

the corpus is designed as a open-access resource of speech data,

it will also make possible to advance corpus-based analysis of

spontaneous speech data across languages and speech technology

development as well.

Defining and Counting Phonological Classes inCross-linguistic Segment Databases

Dan Dediu and Scott Moisik

Recently, there has been an explosion in the availability of large,

good-quality cross-linguistic databases such as WALS (Dryer &

Haspelmath, 2013), Glottolog (Hammarstrom et al., 2015) and

Phoible (Moran & McCloy, 2014). Databases such as Phoible

contain the actual segments used by various languages as they

are given in the primary language descriptions. However, this

segment-level representation cannot be used directly for analyses

that require generalizations over classes of segments that share

theoretically interesting features. Here we present a method

and the associated R (R Core Team, 2014) code that allows the

flexible definition of such meaningful classes and that can identify

the sets of segments falling into such a class for any language

inventory. The method and its results are important for those

interested in exploring cross-linguistic patterns of phonetic and

phonological diversity and their relationship to extra-linguistic

factors and processes such as climate, economics, history or

human genetics.

Introducing the SEA_ AP: an Enhanced Tool forAutomatic Prosodic Analysis

Marta Martinez, Rocio Varela, Carmen García Mateo,Elisa Fernandez Rei and Adela Martinez Calvo

SEA_ AP (Segmentador e Etiquetador Automático para

Análise Prosódica, Automatic Segmentation and Labelling for

Prosodic Analysis) toolkit is an application that performs audio

segmentation and labelling to create a TextGrid file which will be

used to launch a prosodic analysis using Praat. In this paper, we

want to describe the improved functionality of the tool achieved

by adding a dialectometric analysis module using R scripts. The

dialectometric analysis includes computing correlations among

F0 curves and it obtains prosodic distances among the different

variables of interest (location, speaker, structure, etc.). The

dialectometric analysis requires large databases in order to be

adequately computed, and automatic segmentation and labelling

can create them thanks to a procedure less costly than the manual

alternative. Thus, the integration of these tools into the SEA_

AP allows to propose a distribution of geoprosodic areas by

means of a quantitative method, which completes the traditional

dialectological point of view. The current version of the SEA_

AP toolkit is capable of analysing Galician, Spanish and Brazilian

Portuguese data, and hence the distances between several prosodic

linguistic varieties can be measured at present.

A Machine Learning based Music Retrieval andRecommendation System

Naziba Mostafa, Yan Wan, Unnayan Amitabh and PascaleFung

In this paper, we present a music retrieval and recommendation

system using machine learning techniques. We propose a query

by humming system for music retrieval that uses deep neural

networks for note transcription and a note-based retrieval system

for retrieving the correct song from the database. We evaluate

our query by humming system using the standard MIREX QBSH

dataset. We also propose a similar artist recommendation system

which recommends similar artists based on acoustic features of

the artists’ music, online text descriptions of the artists and social

media data. We use supervised machine learning techniques over

all our features and compare our recommendation results to those

produced by a popular similar artist recommendation website.

P24 - Speech Processing (1)Thursday, May 26, 9:45

Chairperson: Andrew Caines Poster Session

CirdoX: an on/off-line multisource speech andsound analysis software

Frédéric Aman, Michel Vacher, François Portet, WilliamDuclot and Benjamin Lecouteux

Vocal User Interfaces in domestic environments recently gained

interest in the speech processing community. This interest is

due to the opportunity of using it in the framework of Ambient

Assisted Living both for home automation (vocal command) and

for call for help in case of distress situations, i.e. after a fall. C

IRDO X, which is a modular software, is able to analyse online

the audio environment in a home, to extract the uttered sentences

and then to process them thanks to an ASR module. Moreover,

this system perfoms non-speech audio event classification; in this

68

case, specific models must be trained. The software is designed to

be modular and to process on-line the audio multichannel stream.

Some exemples of studies in which C IRDO X was involved are

described. They were operated in real environment, namely a

Living lab environment.

Optimizing Computer-Assisted TranscriptionQuality with Iterative User Interfaces

Matthias Sperber, Graham Neubig, Satoshi Nakamura andAlex Waibel

Computer-assisted transcription promises high-quality speech

transcription at reduced costs. This is achieved by limiting human

effort to transcribing parts for which automatic transcription

quality is insufficient. Our goal is to improve the human

transcription quality via appropriate user interface design. We

focus on iterative interfaces that allow humans to solve tasks

based on an initially given suggestion, in this case an automatic

transcription. We conduct a user study that reveals considerable

quality gains for three variations of iterative interfaces over

a non-iterative from-scratch transcription interface. Our

iterative interfaces included post-editing, confidence-enhanced

post-editing, and a novel retyping interface. All three yielded

similar quality on average, but we found that the proposed

retyping interface was less sensitive to the difficulty of the

segment, and superior when the automatic transcription of the

segment contained relatively many errors. An analysis using

mixed-effects models allows us to quantify these and other factors

and draw conclusions over which interface design should be

chosen in which circumstance.

A Framework for Collecting Realistic Recordingsof Dysarthric Speech - the homeService Corpus

Mauro Nicolao, Heidi Christensen, Stuart Cunningham,Phil Green and Thomas Hain

This paper introduces a new British English speech database,

named the homeService corpus, which has been gathered as

part of the homeService project. This project aims to help

users with speech and motor disabilities to operate their home

appliances using voice commands. The audio recorded during

such interactions consists of realistic data of speakers with severe

dysarthria. The majority of the homeService corpus is recorded

in real home environments where voice control is often the

normal means by which users interact with their devices. The

collection of the corpus is motivated by the shortage of realistic

dysarthric speech corpora available to the scientific community.

Along with the details on how the data is organised and how it

can be accessed, a brief description of the framework used to

make the recordings is provided. Finally, the performance of the

homeService automatic recogniser for dysarthric speech trained

with single-speaker data from the corpus is provided as an initial

baseline. Access to the homeService corpus is provided through

the dedicated web page at http://mini.dcs.shef.ac.

uk/resources/homeservice-corpus/. This will also

have the most updated description of the data. At the time of

writing the collection process is still ongoing.

Automatic Anomaly Detection for Dysarthriaacross Two Speech Styles: Read vs SpontaneousSpeech

Imed Laaridh, Corinne Fredouille and Christine Meunier

Perceptive evaluation of speech disorders is still the standard

method in clinical practice for the diagnosing and the following

of the condition progression of patients. Such methods include

different tasks such as read speech, spontaneous speech, isolated

words, sustained vowels, etc. In this context, automatic

speech processing tools have proven pertinence in speech quality

evaluation and assistive technology-based applications. Though,

a very few studies have investigated the use of automatic tools

on spontaneous speech. This paper investigates the behavior

of an automatic phone-based anomaly detection system when

applied on read and spontaneous French dysarthric speech. The

behavior of the automatic tool reveals interesting inter-pathology

differences across speech styles.

TTS for Low Resource Languages: A BanglaSynthesizer

Alexander Gutkin, Linne Ha, Martin Jansche, KnotPipatsrisawat and Richard Sproat

We present a text-to-speech (TTS) system designed for the dialect

of Bengali spoken in Bangladesh. This work is part of an

ongoing effort to address the needs of under-resourced languages.

We propose a process for streamlining the bootstrapping of

TTS systems for under-resourced languages. First, we use

crowdsourcing to collect the data from multiple ordinary speakers,

each speaker recording small amount of sentences. Second,

we leverage an existing text normalization system for a related

language (Hindi) to bootstrap a linguistic front-end for Bangla.

Third, we employ statistical techniques to construct multi-speaker

acoustic models using Long Short-Term Memory Recurrent

Neural Network (LSTM-RNN) and Hidden Markov Model

(HMM) approaches. We then describe our experiments that show

69

that the resulting TTS voices score well in terms of their perceived

quality as measured by Mean Opinion Score (MOS) evaluations.

Speech Trax: A Bottom to the Top Approach forSpeaker Tracking and Indexing in an ArchivingContext

Félicien Vallet, Jim Uro, Jérémy Andriamakaoly, HakimNabi, Mathieu Derval and Jean Carrive

With the increasing amount of audiovisual and digital data

deriving from televisual and radiophonic sources, professional

archives such as INA, France’s national audiovisual institute,

acknowledge a growing need for efficient indexing tools. In

this paper, we describe the Speech Trax system that aims at

analyzing the audio content of TV and radio documents. In

particular, we focus on the speaker tracking task that is very

valuable for indexing purposes. First, we detail the overall

architecture of the system and show the results obtained on a

large-scale experiment, the largest to our knowledge for this type

of content (about 1,300 speakers). Then, we present the Speech

Trax demonstrator that gathers the results of various automatic

speech processing techniques on top of our speaker tracking

system (speaker diarization, speech transcription, etc.). Finally,

we provide insight on the obtained performances and suggest hints

for future improvements.

O21 - Social MediaThursday, May 26, 11:45

Chairperson: Piek Vossen Oral Session

Web Chat Conversations from Contact Centers: aDescriptive Study

Geraldine Damnati, Aleksandra Guerraz and DelphineCharlet

In this article we propose a descriptive study of a chat

conversations corpus from an assistance contact center.

Conversations are described from several view points, including

interaction analysis, language deviation analysis and typographic

expressivity marks analysis. We provide in particular a detailed

analysis of language deviations that are encountered in our corpus

of 230 conversations, corresponding to 6879 messages and 76839

words. These deviations may be challenging for further syntactic

and semantic parsing. Analysis is performed with a distinction

between Customer messages and Agent messages. On the

overall only 4% of the observed words are misspelled but 26%

of the messages contain at least one erroneous word (rising to

40% when focused on Customer messages). Transcriptions of

telephone conversations from an assistance call center are also

studied, allowing comparisons between these two interaction

modes to be drawn. The study reveals significant differences

in terms of conversation flow, with an increased efficiency for

chat conversations in spite of longer temporal span.

Identification of Drug-Related Medical Conditionsin Social MediaFrançois Morlane-Hondère, Cyril Grouin and PierreZweigenbaum

Monitoring social media has been shown to be an interesting

approach for the early detection of drug adverse effects. In

this paper, we describe a system which extracts medical entities

in French drug reviews written by users. We focus on the

identification of medical conditions, which is based on the concept

of post-coordination: we first extract minimal medical-related

entities (pain, stomach) then we combine them to identify complex

ones (It was the worst [pain I ever felt in my stomach]). These

two steps are respectively performed by two classifiers, the first

being based on Conditional Random Fields and the second one

on Support Vector Machines. The overall results of the minimal

entity classifier are the following: P=0.926; R=0.849; F1=0.886.

A thourough analysis of the feature set shows that, when

combined with word lemmas, clusters generated by word2vec are

the most valuable features. When trained on the output of the first

classifier, the second classifier’s performances are the following:

p=0.683;r=0.956;f1=0.797. The addition of post-processing rules

did not add any significant global improvement but was found to

modify the precision/recall ratio.

Creating a Lexicon of Bavarian Dialect by Meansof Facebook Language Data and CrowdsourcingManuel Burghardt, Daniel Granvogl and Christian Wolff

Data acquisition in dialectology is typically a tedious task, as

dialect samples of spoken language have to be collected via

questionnaires or interviews. In this article, we suggest to use

the “web as a corpus” approach for dialectology. We present a

case study that demonstrates how authentic language data for the

Bavarian dialect (ISO 639-3:bar) can be collected automatically

from the social network Facebook. We also show that Facebook

can be used effectively as a crowdsourcing platform, where users

are willing to translate dialect words collaboratively in order to

create a common lexicon of their Bavarian dialect. Key insights

from the case study are summarized as “lessons learned” together

with suggestions for future enhancements of the lexicon creation

approach.

A Corpus of Wikipedia Discussions: Over theYears, with Topic, Power and Gender LabelsVinodkumar Prabhakaran and Owen Rambow

In order to gain a deep understanding of how social context

manifests in interactions, we need data that represents interactions

70

from a large community of people over a long period of time,

capturing different aspects of social context. In this paper, we

present a large corpus of Wikipedia Talk page discussions that

are collected from a broad range of topics, containing discussions

that happened over a period of 15 years. The dataset contains

166,322 discussion threads, across 1236 articles/topics that span

15 different topic categories or domains. The dataset also captures

whether the post is made by an registered user or not, and

whether he/she was an administrator at the time of making the

post. It also captures the Wikipedia age of editors in terms of

number of months spent as an editor, as well as their gender.

This corpus will be a valuable resource to investigate a variety

of computational sociolinguistics research questions regarding

online social interactions.

O22 - Anaphora and CoreferenceThursday, May 26, 11:45

Chairperson: Eva Hajicová Oral Session

Phrase Detectives Corpus 1.0 CrowdsourcedAnaphoric Coreference.

Jon Chamberlain, Massimo Poesio and Udo Kruschwitz

Natural Language Engineering tasks require large and complex

annotated datasets to build more advanced models of language.

Corpora are typically annotated by several experts to create

a gold standard; however, there are now compelling reasons

to use a non-expert crowd to annotate text, driven by cost,

speed and scalability. Phrase Detectives Corpus 1.0 is an

anaphorically-annotated corpus of encyclopedic and narrative text

that contains a gold standard created by multiple experts, as well

as a set of annotations created by a large non-expert crowd.

Analysis shows very good inter-expert agreement (kappa=.88-.93)

but a more variable baseline crowd agreement (kappa=.52-.96).

Encyclopedic texts show less agreement (and by implication are

harder to annotate) than narrative texts. The release of this corpus

is intended to encourage research into the use of crowds for text

annotation and the development of more advanced, probabilistic

language models, in particular for anaphoric coreference.

Summ-it++: an Enriched Version of the Summ-itCorpus

Evandro Fonseca, André Antonitsch, Sandra Collovini,Daniela Amaral, Renata Vieira and Anny Figueira

This paper presents Summ-it++, an enriched version the Summ-

it corpus. In this new version, the corpus has received new

semantic layers, named entity categories and relations between

named entities, adding to the previous coreference annotation. In

addition, we change the original Summ-it format to SemEval.

Towards Multiple Antecedent CoreferenceResolution in Specialized Discourse

Alicia Burga, Sergio Cajal, Joan Codina-Filba and LeoWanner

Despite the popularity of coreference resolution as a research

topic, the overwhelming majority of the work in this area focused

so far on single antecedence coreference only. Multiple antecedent

coreference (MAC) has been largely neglected. This can be

explained by the scarcity of the phenomenon of MAC in generic

discourse. However, in specialized discourse such as patents,

MAC is very dominant. It seems thus unavoidable to address

the problem of MAC resolution in the context of tasks related

to automatic patent material processing, among them abstractive

summarization, deep parsing of patents, construction of concept

maps of the inventions, etc. We present the first version of

an operational rule-based MAC resolution strategy for patent

material that covers the three major types of MAC: (i) nominal

MAC, (ii) MAC with personal / relative pronouns, and MAC with

reflexive / reciprocal pronouns. The evaluation shows that our

strategy performs well in terms of precision and recall.

ARRAU: Linguistically-Motivated Annotation ofAnaphoric Descriptions

Olga Uryupina, Ron Artstein, Antonella Bristot, FedericaCavicchio, Kepa Rodriguez and Massimo Poesio

This paper presents a second release of the ARRAU dataset:

a multi-domain corpus with thorough linguistically motivated

annotation of anaphora and related phenomena. Building upon

the first release almost a decade ago, a considerable effort had

been invested in improving the data both quantitatively and

qualitatively. Thus, we have doubled the corpus size, expanded

the selection of covered phenomena to include referentiality and

genericity and designed and implemented a methodology for

enforcing the consistency of the manual annotation. We believe

that the new release of ARRAU provides a valuable material for

ongoing research in complex cases of coreference as well as for a

variety of related tasks. The corpus is publicly available through

LDC.

71

O23 - Machine Learning and InformationExtractionThursday, May 26, 11:45

Chairperson: Feiyu Xu Oral Session

An Annotated Corpus and Method for Analysis ofAd-Hoc Structures Embedded in Text

Eric Yeh, John Niekrasz, Dayne Freitag and RichardRohwer

We describe a method for identifying and performing functional

analysis of structured regions that are embedded in natural

language documents, such as tables or key-value lists. Such

regions often encode information according to ad hoc schemas

and avail themselves of visual cues in place of natural language

grammar, presenting problems for standard information extraction

algorithms. Unlike previous work in table extraction, which

assumes a relatively noiseless two-dimensional layout, our

aim is to accommodate a wide variety of naturally occurring

structure types. Our approach has three main parts. First, we

collect and annotate a a diverse sample of “naturally” occurring

structures from several sources. Second, we use probabilistic text

segmentation techniques, featurized by skip bigrams over spatial

and token category cues, to automatically identify contiguous

regions of structured text that share a common schema. Finally,

we identify the records and fields within each structured region

using a combination of distributional similarity and sequence

alignment methods, guided by minimal supervision in the form

of a single annotated record. We evaluate the last two components

individually, and conclude with a discussion of further work.

Learning Thesaurus Relations from DistributionalFeatures

Rosa Tsegaye Aga, Christian Wartena, Lucas Drumond andLars Schmidt-Thieme

In distributional semantics words are represented by aggregated

context features. The similarity of words can be computed by

comparing their feature vectors. Thus, we can predict whether

two words are synonymous or similar with respect to some

other semantic relation. We will show on six different datasets

of pairs of similar and non-similar words that a supervised

learning algorithm on feature vectors representing pairs of words

outperforms cosine similarity between vectors representing single

words. We compared different methods to construct a feature

vector representing a pair of words. We show that simple methods

like pairwise addition or multiplication give better results than

a recently proposed method that combines different types of

features. The semantic relation we consider is relatedness of

terms in thesauri for intellectual document classification. Thus our

findings can directly be applied for the maintenance and extension

of such thesauri. To the best of our knowledge this relation was

not considered before in the field of distributional semantics.

Factuality Annotation and Learning in SpanishTextsDina Wonsever, Aiala Rosá and Marisa Malcuori

We present a proposal for the annotation of factuality of event

mentions in Spanish texts and a free available annotated corpus.

Our factuality model aims to capture a pragmatic notion of

factuality, trying to reflect a casual reader judgements about

the realis / irrealis status of mentioned events. Also, some

learning experiments (SVM and CRF) have been held, showing

encouraging results.

NNBlocks: A Deep Learning Framework forComputational Linguistics Neural NetworkModelsFrederico Tommasi Caroli, André Freitas, João CarlosPereira da Silva and Siegfried Handschuh

Lately, with the success of Deep Learning techniques in some

computational linguistics tasks, many researchers want to explore

new models for their linguistics applications. These models tend

to be very different from what standard Neural Networks look

like, limiting the possibility to use standard Neural Networks

frameworks. This work presents NNBlocks, a new framework

written in Python to build and train Neural Networks that are not

constrained by a specific kind of architecture, making it possible

to use it in computational linguistics.

O24 - -Speech Corpus for HealthThursday, May 26, 11:45

Chairperson: Eleni Efthimiou Oral Session

Automatic identification of Mild CognitiveImpairment through the analysis of Italianspontaneous speech productionsDaniela Beltrami, Laura Calzà, Gloria Gagliardi, EnricoGhidoni, Norina Marcello, Rema Rossini Favretti andFabio Tamburini

This paper presents some preliminary results of the OPLON

project. It aimed at identifying early linguistic symptoms of

cognitive decline in the elderly. This pilot study was conducted

on a corpus composed of spontaneous speech sample collected

from 39 subjects, who underwent a neuropsychological screening

for visuo-spatial abilities, memory, language, executive functions

and attention. A rich set of linguistic features was extracted from

the digitalised utterances (at phonetic, suprasegmental, lexical,

morphological and syntactic levels) and the statistical significance

72

in pinpointing the pathological process was measured. Our results

show remarkable trends for what concerns both the linguistic traits

selection and the automatic classifiers building.

On the Use of a Serious Game for Recording aSpeech Corpus of People with IntellectualDisabilities

Mario Corrales-Astorgano, David Escudero-Mancebo,Yurena Gutiérrez-González, Valle Flores-Lucas, CésarGonzález-Ferreras and Valentín Cardeñoso-Payo

This paper describes the recording of a speech corpus focused

on prosody of people with intellectual disabilities. To do this,

a video game is used with the aim of improving the user’s

motivation. Moreover, the player’s profiles and the sentences

recorded during the game sessions are described. With the

purpose of identifying the main prosodic troubles of people with

intellectual disabilities, some prosodic features are extracted from

recordings, like fundamental frequency, energy and pauses. After

that, a comparison is made between the recordings of people with

intellectual disabilities and people without intellectual disabilities.

This comparison shows that pauses are the best discriminative

feature between these groups. To check this, a study has been

done using machine learning techniques, with a classification rate

superior to 80%.

Building Language Resources for ExploringAutism Spectrum Disorders

Julia Parish-Morris, Christopher Cieri, Mark Liberman,Leila Bateman, Emily Ferguson and Robert T. Schultz

Autism spectrum disorder (ASD) is a complex

neurodevelopmental condition that would benefit from low-cost

and reliable improvements to screening and diagnosis. Human

language technologies (HLTs) provide one possible route to

automating a series of subjective decisions that currently inform

“Gold Standard” diagnosis based on clinical judgment. In this

paper, we describe a new resource to support this goal, comprised

of 100 20-minute semi-structured English language samples

labeled with child age, sex, IQ, autism symptom severity, and

diagnostic classification. We assess the feasibility of digitizing

and processing sensitive clinical samples for data sharing, and

identify areas of difficulty. Using the methods described here, we

propose to join forces with researchers and clinicians throughout

the world to establish an international repository of annotated

language samples from individuals with ASD and related

disorders. This project has the potential to improve the lives of

individuals with ASD and their families by identifying linguistic

features that could improve remote screening, inform personalized

intervention, and promote advancements in clinically-oriented

HLTs.

Vocal Pathologies Detection and MispronouncedPhonemes Identification: Case of ArabicContinuous Speech

Naim Terbeh and Mounir Zrigui

We propose in this work a novel acoustic phonetic study for

Arabic people suffering from language disabilities and non-

native learners of Arabic language to classify Arabic continuous

speech to pathological or healthy and to identify phonemes that

pose pronunciation problems (case of pathological speeches).

The main idea can be summarized in comparing between the

phonetic model reference to Arabic spoken language and that

proper to concerned speaker. For this task, we use techniques of

automatic speech processing like forced alignment and artificial

neural network (ANN) (Basheer, 2000). Based on a test corpus

containing 100 speech sequences, recorded by different speakers

(healthy/pathological speeches and native/foreign speakers), we

attain 97% as classification rate. Algorithms used in identifying

phonemes that pose pronunciation problems show high efficiency:

we attain an identification rate of 100%.

P25 - CrowdsourcingThursday, May 26, 11:45

Chairperson: Monica Monachini Poster Session

Wikipedia Titles As Noun Tag Predictors

Armin Hoenen

In this paper, we investigate a covert labeling cue, namely the

probability that a title (by example of the Wikipedia titles) is

a noun. If this probability is very large, any list such as or

comparable to the Wikipedia titles can be used as a reliable

word-class (or part-of-speech tag) predictor or noun lexicon.

This may be especially useful in the case of Low Resource

Languages (LRL) where labeled data is lacking and putatively for

Natural Language Processing (NLP) tasks such as Word Sense

Disambiguation, Sentiment Analysis and Machine Translation.

Profitting from the ease of digital publication on the web as

opposed to print, LRL speaker communities produce resources

such as Wikipedia and Wiktionary, which can be used for an

assessment. We provide statistical evidence for a strong noun

bias for the Wikipedia titles from 2 corpora (English, Persian)

and a dictionary (Japanese) and for a typologically balanced set of

17 languages including LRLs. Additionally, we conduct a small

73

experiment on predicting noun tags for out-of-vocabulary items in

part-of-speech tagging for English.

Japanese Word–Color Associations with andwithout Contexts

Jun Harashima

Although some words carry strong associations with specific

colors (e.g., the word danger is associated with the color red), few

studies have investigated these relationships. This may be due

to the relative rarity of databases that contain large quantities of

such information. Additionally, these resources are often limited

to particular languages, such as English. Moreover, the existing

resources often do not consider the possible contexts of words

in assessing the associations between a word and a color. As a

result, the influence of context on word–color associations is not

fully understood. In this study, we constructed a novel language

resource for word–color associations. The resource has two

characteristics: First, our resource is the first to include Japanese

word–color associations, which were collected via crowdsourcing.

Second, the word–color associations in the resource are linked

to contexts. We show that word–color associations depend on

language and that associations with certain colors are affected by

context information.

The VU Sound Corpus: Adding MoreFine-grained Annotations to the FreesoundDatabase

Emiel van Miltenburg, Benjamin Timmermans and LoraAroyo

This paper presents a collection of annotations (tags or keywords)

for a set of 2,133 environmental sounds taken from the Freesound

database (www.freesound.org). The annotations are acquired

through an open-ended crowd-labeling task, in which participants

were asked to provide keywords for each of three sounds. The

main goal of this study is to find out (i) whether it is feasible

to collect keywords for a large collection of sounds through

crowdsourcing, and (ii) how people talk about sounds, and what

information they can infer from hearing a sound in isolation. Our

main finding is that it is not only feasible to perform crowd-

labeling for a large collection of sounds, it is also very useful to

highlight different aspects of the sounds that authors may fail to

mention. Our data is freely available, and can be used to ground

semantic models, improve search in audio databases, and to study

the language of sound.

Crowdsourcing a Large Dataset ofDomain-Specific Context-Sensitive Semantic VerbRelations

Maria Sukhareva, Judith Eckle-Kohler, Ivan Habernal andIryna Gurevych

We present a new large dataset of 12403 context-sensitive verb

relations manually annotated via crowdsourcing. These relations

capture fine-grained semantic information between verb-centric

propositions, such as temporal or entailment relations. We

propose a novel semantic verb relation scheme and design a

multi-step annotation approach for scaling-up the annotations

using crowdsourcing. We employ several quality measures and

report on agreement scores. The resulting dataset is available

under a permissive CreativeCommons license at www.ukp.tu-

darmstadt.de/data/verb-relations/. It represents a valuable

resource for various applications, such as automatic information

consolidation or automatic summarization.

Acquiring Opposition Relations among ItalianVerb Senses using Crowdsourcing

Anna Feltracco, Simone Magnolini, Elisabetta Jezek andBernardo Magnini

We describe an experiment for the acquisition of opposition

relations among Italian verb senses, based on a crowdsourcing

methodology. The goal of the experiment is to discuss whether

the types of opposition we distinguish (i.e. complementarity,

antonymy, converseness and reversiveness) are actually perceived

by the crowd. In particular, we collect data for Italian by using

the crowdsourcing platform CrowdFlower. We ask annotators to

judge the type of opposition existing among pairs of sentences

-previously judged as opposite- that differ only for a verb: the verb

in the first sentence is opposite of the verb in second sentence.

Data corroborate the hypothesis that some opposition relations

exclude each other, while others interact, being recognized as

compatible by the contributors.

Crowdsourcing a Multi-lingual Speech Corpus:Recording, Transcription and Annotation of theCrowdIS Corpora

Andrew Caines, Christian Bentz, Calbert Graham, TimPolzehl and Paula Buttery

We announce the release of the CROWDED CORPUS: a pair

of speech corpora collected via crowdsourcing, containing a

native speaker corpus of English (CROWDED_ ENGLISH),

and a corpus of German/English bilinguals (CROWDED_

74

BILINGUAL). Release 1 of the CROWDED CORPUS contains

1000 recordings amounting to 33,400 tokens collected from

80 speakers and is freely available to other researchers. We

recruited participants via the Crowdee application for Android.

Recruits were prompted to respond to business-topic questions

of the type found in language learning oral tests. We then

used the CrowdFlower web application to pass these recordings

to crowdworkers for transcription and annotation of errors and

sentence boundaries. Finally, the sentences were tagged and

parsed using standard natural language processing tools. We

propose that crowdsourcing is a valid and economical method for

corpus collection, and discuss the advantages and disadvantages

of this approach.

The REAL Corpus: A Crowd-Sourced Corpus ofHuman Generated and Evaluated SpatialReferences to Real-World Urban Scenes

Phil Bartie, William Mackaness, Dimitra Gkatzia andVerena Rieser

Our interest is in people’s capacity to efficiently and effectively

describe geographic objects in urban scenes. The broader

ambition is to develop spatial models capable of equivalent

functionality able to construct such referring expressions. To

that end we present a newly crowd-sourced data set of natural

language references to objects anchored in complex urban scenes

(In short: The REAL Corpus – Referring Expressions Anchored

Language). The REAL corpus contains a collection of images

of real-world urban scenes together with verbal descriptions of

target objects generated by humans, paired with data on how

successful other people were able to identify the same object based

on these descriptions. In total, the corpus contains 32 images

with on average 27 descriptions per image and 3 verifications

for each description. In addition, the corpus is annotated with a

variety of linguistically motivated features. The paper highlights

issues posed by collecting data using crowd-sourcing with an

unrestricted input format, as well as using real-world urban

scenes.

Introducing the Weighted Trustability Evaluatorfor Crowdsourcing Exemplified by SpeakerLikability Classification

Simone Hantke, Erik Marchi and Björn Schuller

Crowdsourcing is an arising collaborative approach applicable

among many other applications to the area of language and

speech processing. In fact, the use of crowdsourcing was

already applied in the field of speech processing with promising

results. However, only few studies investigated the use

of crowdsourcing in computational paralinguistics. In this

contribution, we propose a novel evaluator for crowdsourced-

based ratings termed Weighted Trustability Evaluator (WTE)

which is computed from the rater-dependent consistency over

the test questions. We further investigate the reliability of

crowdsourced annotations as compared to the ones obtained with

traditional labelling procedures, such as constrained listening

experiments in laboratories or in controlled environments.

This comparison includes an in-depth analysis of obtainable

classification performances. The experiments were conducted

on the Speaker Likability Database (SLD) already used in the

INTERSPEECH Challenge 2012, and the results lend further

weight to the assumption that crowdsourcing can be applied as a

reliable annotation source for computational paralinguistics given

a sufficient number of raters and suited measurements of their

reliability.

P26 - Emotion Recognition/GenerationThursday, May 26, 11:45

Chairperson: Saif Mohammad Poster Session

Comparison of Emotional Understanding inModality-Controlled Environments usingMultimodal Online Emotional CommunicationCorpus

Yoshiko Arimoto and Kazuo Okanoya

In online computer-mediated communication, speakers were

considered to have experienced difficulties in catching their

partner’s emotions and in conveying their own emotions. To

explain why online emotional communication is so difficult and to

investigate how this problem should be solved, multimodal online

emotional communication corpus was constructed by recording

approximately 100 speakers’ emotional expressions and reactions

in a modality-controlled environment. Speakers communicated

over the Internet using video chat, voice chat or text chat; their

face-to-face conversations were used for comparison purposes.

The corpora incorporated emotional labels by evaluating the

speaker’s dynamic emotional states and the measurements of

the speaker’s facial expression, vocal expression and autonomic

nervous system activity. For the initial study of this project,

which used a large-scale emotional communication corpus, the

accuracy of online emotional understanding was assessed to

demonstrate the emotional labels evaluated by the speakers

and to summarize the speaker’s answers on the questionnaire

regarding the difference between an online chat and face-to-

face conversations in which they actually participated. The

results revealed that speakers have difficulty communicating their

emotions in online communication environments, regardless of

the type of communication modality and that inaccurate emotional

75

understanding occurs more frequently in online computer-

mediated communication than in face-to-face communication.

Laughter in French Spontaneous ConversationalDialogs

Brigitte Bigi and Roxane Bertrand

This paper presents a quantitative description of laughter in height

1-hour French spontaneous conversations. The paper includes the

raw figures for laughter as well as more details concerning inter-

individual variability. It firstly describes to what extent the amount

of laughter and their durations varies from speaker to speaker in

all dialogs. In a second suite of analyses, this paper compares

our corpus with previous analyzed corpora. In a final set of

experiments, it presents some facts about overlapping laughs. This

paper have quantified these all effects in free-style conversations,

for the first time.

AVAB-DBS: an Audio-Visual Affect BurstsDatabase for Synthesis

Kevin El Haddad, Huseyin Cakmak, Stéphane Dupont andThierry Dutoit

It has been shown that adding expressivity and emotional

expressions to an agent’s communication systems would improve

the interaction quality between this agent and a human user. In

this paper we present a multimodal database of affect bursts,

which are very short non-verbal expressions with facial, vocal, and

gestural components that are highly synchronized and triggered by

an identifiable event. This database contains motion capture and

audio data of affect bursts representing disgust, startle and surprise

recorded at three different levels of arousal each. This database is

to be used for synthesis purposes in order to generate affect bursts

of these emotions on a continuous arousal level scale.

Construction of Japanese Audio-Visual EmotionDatabase and Its Application in EmotionRecognition

Nurul Lubis, Randy Gomez, Sakriani Sakti, KeisukeNakamura, Koichiro Yoshino, Satoshi Nakamura andKazuhiro Nakadai

Emotional aspects play a vital role in making human

communication a rich and dynamic experience. As we introduce

more automated system in our daily lives, it becomes increasingly

important to incorporate emotion to provide as natural an

interaction as possible. To achieve said incorporation, rich sets

of labeled emotional data is prerequisite. However, in Japanese,

existing emotion database is still limited to unimodal and bimodal

corpora. Since emotion is not only expressed through speech,

but also visually at the same time, it is essential to include

multiple modalities in an observation. In this paper, we present

the first audio-visual emotion corpora in Japanese, collected from

14 native speakers. The corpus contains 100 minutes of annotated

and transcribed material. We performed preliminary emotion

recognition experiments on the corpus and achieved an accuracy

of 61.42% for five classes of emotion.

Evaluating Context Selection Strategies to BuildEmotive Vector Space Models

Lucia C. Passaro and Alessandro Lenci

In this paper we compare different context selection approaches

to improve the creation of Emotive Vector Space Models (VSMs).

The system is based on the results of an existing approach that

showed the possibility to create and update VSMs by exploiting

crowdsourcing and human annotation. Here, we introduce a

method to manipulate the contexts of the VSMs under the

assumption that the emotive connotation of a target word is a

function of both its syntagmatic and paradigmatic association

with the various emotions. To study the differences among the

proposed spaces and to confirm the reliability of the system, we

report on two experiments: in the first one we validated the

best candidates extracted from each model, and in the second

one we compared the models’ performance on a random sample

of target words. Both experiments have been implemented as

crowdsourcing tasks.

P27 - Machine Translation (2)Thursday, May 26, 11:45

Chairperson: Aljoscha Burchardt Poster Session

Simultaneous Sentence Boundary Detection andAlignment with Pivot-based Machine TranslationGenerated Lexicons

Antoine Bourlon, Chenhui Chu, Toshiaki Nakazawa andSadao Kurohashi

Sentence alignment is a task that consists in aligning the parallel

sentences in a translated article pair. This paper describes a

method to perform sentence boundary detection and alignment

simultaneously, which significantly improves the alignment

accuracy on languages like Chinese with uncertain sentence

boundaries. It relies on the definition of hard (certain) and

soft (uncertain) punctuation delimiters, the latter being possibly

ignored to optimize the alignment result. The alignment method

is used in combination with lexicons automatically generated

from the input article pairs using pivot-based MT, achieving

better coverage of the input words with fewer entries than pre-

existing dictionaries. Pivot-based MT makes it possible to build

dictionaries for language pairs that have scarce parallel data. The

76

alignment method is implemented in a tool that will be freely

available in the near future.

That’ll Do Fine!: A Coarse Lexical Resource forEnglish-Hindi MT, Using Polylingual TopicModels

Diptesh Kanojia, Aditya Joshi, Pushpak Bhattacharyya andMark James Carman

Parallel corpora are often injected with bilingual lexical resources

for improved Indian language machine translation (MT). In

absence of such lexical resources, multilingual topic models

have been used to create coarse lexical resources in the past,

using a Cartesian product approach. Our results show that

for morphologically rich languages like Hindi, the Cartesian

product approach is detrimental for MT. We then present a novel

‘sentential’ approach to use this coarse lexical resource from

a multilingual topic model. Our coarse lexical resource when

injected with a parallel corpus outperforms a system trained

using parallel corpus and a good quality lexical resource. As

demonstrated by the quality of our coarse lexical resource and

its benefit to MT, we believe that our sentential approach to

create such a resource will help MT for resource-constrained

languages.

ASPEC: Asian Scientific Paper Excerpt Corpus

Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto,Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi andHitoshi Isahara

In this paper, we describe the details of the ASPEC (Asian

Scientific Paper Excerpt Corpus), which is the first large-

size parallel corpus of scientific paper domain. ASPEC

was constructed in the Japanese-Chinese machine translation

project conducted between 2006 and 2010 using the Special

Coordination Funds for Promoting Science and Technology.

It consists of a Japanese-English scientific paper abstract

corpus of approximately 3 million parallel sentences (ASPEC-

JE) and a Chinese-Japanese scientific paper excerpt corpus

of approximately 0.68 million parallel sentences (ASPEC-JC).

ASPEC is used as the official dataset for the machine translation

evaluation workshop WAT (Workshop on Asian Translation).

Domain Adaptation in MT Using Titles inWikipedia as a Parallel Corpus: Resources andEvaluation

Gorka Labaka, Iñaki Alegria and Kepa Sarasola

This paper presents how an state-of-the-art SMT system is

enriched by using an extra in-domain parallel corpora extracted

from Wikipedia. We collect corpora from parallel titles and from

parallel fragments in comparable articles from Wikipedia. We

carried out an evaluation with a double objective: evaluating

the quality of the extracted data and evaluating the improvement

due to the domain-adaptation. We think this can be very useful

for languages with limited amount of parallel corpora, where in-

domain data is crucial to improve the performance of MT sytems.

The experiments on the Spanish-English language pair improve a

baseline trained with the Europarl corpus in more than 2 points of

BLEU when translating in the Computer Science domain.

ProphetMT: A Tree-based SMT-driven ControlledLanguage Authoring/Post-Editing ToolXiaofeng Wu, Jinhua Du, Qun Liu and Andy Way

This paper presents ProphetMT, a tree-based SMT-driven

Controlled Language (CL) authoring and post-editing tool.

ProphetMT employs the source-side rules in a translation model

and provides them as auto-suggestions to users. Accordingly, one

might say that users are writing in a Controlled Language that

is understood by the computer. ProphetMT also allows users

to easily attach structural information as they compose content.

When a specific rule is selected, a partial translation is promptly

generated on-the-fly with the help of the structural information.

Our experiments conducted on English-to-Chinese show that

our proposed ProphetMT system can not only better regularise

an author’s writing behaviour, but also significantly improve

translation fluency which is vital to reduce the post-editing time.

Additionally, when the writing and translation process is over,

ProphetMT can provide an effective colour scheme to further

improve the productivity of post-editors by explicitly featuring the

relations between the source and target rules.

Towards producing bilingual lexica frommonolingual corporaJingyi Han and Núria Bel

Bilingual lexica are the basis for many cross-lingual natural

language processing tasks. Recent works have shown success in

learning bilingual dictionary by taking advantages of comparable

corpora and a diverse set of signals derived from monolingual

corpora. In the present work, we describe an approach to

automatically learn bilingual lexica by training a supervised

classifier using word embedding-based vectors of only a few

hundred translation equivalent word pairs. The word embedding

representations of translation pairs were obtained from source and

target monolingual corpora, which are not necessarily related.

Our classifier is able to predict whether a new word pair is

under a translation relation or not. We tested it on two quite

distinct language pairs Chinese-Spanish and English-Spanish.

The classifiers achieved more than 0.90 precision and recall for

both language pairs in different evaluation scenarios. These results

show a high potential for this method to be used in bilingual lexica

77

production for language pairs with reduced amount of parallel or

comparable corpora, in particular for phrase table expansion in

Statistical Machine Translation systems.

First Steps Towards Coverage-Based SentenceAlignment

Luís Gomes and Gabriel Pereira Lopes

In this paper, we introduce a coverage-based scoring function that

discriminates between parallel and non-parallel sentences. When

plugged into Bleualign, a state-of-the-art sentence aligner, our

function improves both precision and recall of alignments over the

originally proposed BLEU score. Furthermore, since our scoring

function uses Moses phrase tables directly we avoid the need to

translate the texts to be aligned, which is time-consuming and a

potential source of alignment errors.

Using the TED Talks to Evaluate SpokenPost-editing of Machine Translation

Jeevanthi Liyanapathirana and Andrei Popescu-Belis

This paper presents a solution to evaluate spoken post-editing

of imperfect machine translation output by a human translator.

We compare two approaches to the combination of machine

translation (MT) and automatic speech recognition (ASR): a

heuristic algorithm and a machine learning method. To obtain a

data set with spoken post-editing information, we use the French

version of TED talks as the source texts submitted to MT, and

the spoken English counterparts as their corrections, which are

submitted to an ASR system. We experiment with various levels

of artificial ASR noise and also with a state-of-the-art ASR

system. The results show that the combination of MT with ASR

improves over both individual outputs of MT and ASR in terms of

BLEU scores, especially when ASR performance is low.

Phrase Level Segmentation and Labelling ofMachine Translation Errors

Frédéric Blain, Varvara Logacheva and Lucia Specia

This paper presents our work towards a novel approach for Quality

Estimation (QE) of machine translation based on sequences of

adjacent words, the so-called phrases. This new level of QE

aims to provide a natural balance between QE at word and

sentence-level, which are either too fine grained or too coarse

levels for some applications. However, phrase-level QE implies

an intrinsic challenge: how to segment a machine translation into

sequence of words (contiguous or not) that represent an error.

We discuss three possible segmentation strategies to automatically

extract erroneous phrases. We evaluate these strategies against

annotations at phrase-level produced by humans, using a new

dataset collected for this purpose.

SubCo: A Learner Translation Corpus of Humanand Machine Subtitles

José Manuel Martínez Martínez and Mihaela Vela

In this paper, we present a freely available corpus of human and

automatic translations of subtitles. The corpus comprises, the

original English subtitles (SRC), both human (HT) and machine

translations (MT) into German, as well as post-editions (PE) of

the MT output. HT and MT are annotated with errors. Moreover,

human evaluation is included in HT, MT, and PE. Such a corpus

is a valuable resource for both human and machine translation

communities, enabling the direct comparison – in terms of errors

and evaluation – between human and machine translations and

post-edited machine translations.

P28 - Multiword ExpressionsThursday, May 26, 11:45

Chairperson: Irina Temnikova Poster Session

Towards Lexical Encoding of Multi-WordExpressions in Spanish Dialects

Diana Bogantes, Eric Rodríguez, Alejandro Arauco,Alejandro Rodríguez and Agata Savary

This paper describes a pilot study in lexical encoding of multi-

word expressions (MWEs) in 4 Latin American dialects of

Spanish: Costa Rican, Colombian, Mexican and Peruvian. We

describe the variability of MWE usage across dialects. We adapt

an existing data model to a dialect-aware encoding, so as to

represent dialect-related specificities, while avoiding redundancy

of the data common for all dialects. A dozen of linguistic

properties of MWEs can be expressed in this model, both on

the level of a whole MWE and of its individual components.

We describe the resulting lexical resource containing several

dozens of MWEs in four dialects and we propose a method

for constructing a web corpus as a support for crowdsourcing

examples of MWE occurrences. The resource is available under

an open license and paves the way towards a large-scale dialect-

aware language resource construction, which should prove useful

in both traditional and novel NLP applications.

JATE 2.0: Java Automatic Term Extraction withApache Solr

Ziqi Zhang, Jie Gao and Fabio Ciravegna

Automatic Term Extraction (ATE) or Recognition (ATR) is a

fundamental processing step preceding many complex knowledge

78

engineering tasks. However, few methods have been implemented

as public tools and in particular, available as open-source

freeware. Further, little effort is made to develop an adaptable

and scalable framework that enables customization, development,

and comparison of algorithms under a uniform environment.

This paper introduces JATE 2.0, a complete remake of the free

Java Automatic Term Extraction Toolkit (Zhang et al., 2008)

delivering new features including: (1) highly modular, adaptable

and scalable ATE thanks to integration with Apache Solr, the open

source free-text indexing and search platform; (2) an extended

collection of state-of-the-art algorithms. We carry out experiments

on two well-known benchmarking datasets and compare the

algorithms along the dimensions of effectiveness (precision) and

efficiency (speed and memory consumption). To the best of our

knowledge, this is by far the only free ATE library offering a

flexible architecture and the most comprehensive collection of

algorithms.

A lexicon of perception for the identification ofsynaesthetic metaphors in corpora

Francesca Strik Lievers and Chu-Ren Huang

Synaesthesia is a type of metaphor associating linguistic

expressions that refer to two different sensory modalities.

Previous studies, based on the analysis of poetic texts, have

shown that synaesthetic transfers tend to go from the lower toward

the higher senses (e.g., sweet music vs. musical sweetness).

In non-literary language synaesthesia is rare, and finding a

sufficient number of examples manually would be too time-

consuming. In order to verify whether the directionality also

holds for conventional synaesthesia found in non-literary texts,

an automatic procedure for the identification of instances of

synaesthesia is therefore highly desirable. In this paper, we first

focus on the preliminary step of this procedure, that is, the creation

of a controlled lexicon of perception. Next, we present the results

of a small pilot study that applies the extraction procedure to

English and Italian corpus data.

TermoPL - a Flexible Tool for TerminologyExtraction

Malgorzata Marciniak, Agnieszka Mykowiecka and PiotrRychlik

The purpose of this paper is to introduce the TermoPL tool created

to extract terminology from domain corpora in Polish. The

program extracts noun phrases, term candidates, with the help

of a simple grammar that can be adapted for user’s needs. It

applies the C-value method to rank term candidates being either

the longest identified nominal phrases or their nested subphrases.

The method operates on simplified base forms in order to unify

morphological variants of terms and to recognize their contexts.

We support the recognition of nested terms by word connection

strength which allows us to eliminate truncated phrases from the

top part of the term list. The program has an option to convert

simplified forms of phrases into correct phrases in the nominal

case. TermoPL accepts as input morphologically annotated and

disambiguated domain texts and creates a list of terms, the top

part of which comprises domain terminology. It can also compare

two candidate term lists using three different coefficients showing

asymmetry of term occurrences in this data.

GhoSt-NN: A Representative Gold Standard ofGerman Noun-Noun Compounds

Sabine Schulte im Walde, Anna Hätty, Stefan Bott and NanaKhvtisavrishvili

This paper presents a novel gold standard of German

noun-noun compounds (Ghost-NN) including 868 compounds

annotated with corpus frequencies of the compounds and their

constituents, productivity and ambiguity of the constituents,

semantic relations between the constituents, and compositionality

ratings of compound-constituent pairs. Moreover, a subset

of the compounds containing 180 compounds is balanced for

the productivity of the modifiers (distinguishing low/mid/high

productivity) and the ambiguity of the heads (distinguishing

between heads with 1, 2 and >2 senses

DeQue: A Lexicon of Complex Prepositions andConjunctions in French

Carlos Ramisch, Alexis Nasr, André Valli and JoséDeulofeu

We introduce DeQue, a lexicon covering French complex

prepositions (CPRE) like “à partir de” (from) and complex

conjunctions (CCONJ) like “bien que” (although). The lexicon

includes fine-grained linguistic description based on empirical

evidence. We describe the general characteristics of CPRE and

CCONJ in French, with special focus on syntactic ambiguity.

Then, we list the selection criteria used to build the lexicon and the

corpus-based methodology employed to collect entries. Finally,

we quantify the ambiguity of each construction by annotating

around 100 sentences randomly taken from the FRWaC. In

addition to its theoretical value, the resource has many potential

practical applications. We intend to employ DeQue for treebank

79

annotation and to train a dependency parser that can takes complex

constructions into account.

PARSEME Survey on MWE Resources

Gyri Smørdal Losnegaard, Federico Sangati, Carla ParraEscartín, Agata Savary, Sascha Bargmann and JohannaMonti

This paper summarizes the preliminary results of an ongoing

survey on multiword resources carried out within the IC1207

Cost Action PARSEME (PARSing and Multi-word Expressions).

Despite the availability of language resource catalogs and the

inventory of multiword datasets on the SIGLEX-MWE website,

multiword resources are scattered and difficult to find. In many

cases, language resources such as corpora, treebanks, or lexical

databases include multiwords as part of their data or take them

into account in their annotations. However, these resources need

to be centralized to make them accessible. The aim of this

survey is to create a portal where researchers can easily find

multiword(-aware) language resources for their research. We

report on the design of the survey and analyze the data gathered

so far. We also discuss the problems we have detected upon

examination of the data as well as possible ways of enhancing

the survey.

Multiword Expressions in Child Language

Rodrigo Wilkens, Marco Idiart and Aline Villavicencio

The goal of this work is to introduce CHILDES-MWE, which

contains English CHILDES corpora automatically annotated with

Multiword Expressions (MWEs) information. The result is a

resource with almost 350,000 sentences annotated with more than

70,000 distinct MWEs of various types from both longitudinal

and latitudinal corpora. This resource can be used for large

scale language acquisition studies of how MWEs feature in child

language. Focusing on compound nouns (CN), we then verify

in a longitudinal study if there are differences in the distribution

and compositionality of CNs in child-directed and child-produced

sentences across ages. Moreover, using additional latitudinal data,

we investigate if there are further differences in CN usage and in

compositionality preferences. The results obtained for the child-

produced sentences reflect CN distribution and compositionality

in child-directed sentences.

Transfer-Based Learning-to-Rank Assessment ofMedical Term Technicality

Dhouha Bouamor, Leonardo Campillos Llanos, Anne-Laure Ligozat, Sophie Rosset and Pierre Zweigenbaum

While measuring the readability of texts has been a long-standing

research topic, assessing the technicality of terms has only been

addressed more recently and mostly for the English language.

In this paper, we train a learning-to-rank model to determine a

specialization degree for each term found in a given list. Since

no training data for this task exist for French, we train our system

with non-lexical features on English data, namely, the Consumer

Health Vocabulary, then apply it to French. The features

include the likelihood ratio of the term based on specialized and

lay language models, and tests for containing morphologically

complex words. The evaluation of this approach is conducted on

134 terms from the UMLS Metathesaurus and 868 terms from the

Eugloss thesaurus. The Normalized Discounted Cumulative Gain

obtained by our system is over 0.8 on both test sets. Besides,

thanks to the learning-to-rank approach, adding morphological

features to the language model features improves the results on

the Eugloss thesaurus.

Example-based Acquisition of Fine-grainedCollocation ResourcesSara Rodríguez-Fernández, Roberto Carlini, Luis EspinosaAnke and Leo Wanner

Collocations such as “heavy rain” or “make [a] decision”, are

combinations of two elements where one (the base) is freely

chosen, while the choice of the other (collocate) is restricted,

depending on the base. Collocations present difficulties even

to advanced language learners, who usually struggle to find the

right collocate to express a particular meaning, e.g., both “heavy”

and “strong” express the meaning ’intense’, but while “rain”

selects “heavy”, “wind” selects “strong”. Lexical Functions

(LFs) describe the meanings that hold between the elements of

collocations, such as ’intense’, ’perform’, ’create’, ’increase’,

etc. Language resources with semantically classified collocations

would be of great help for students, however they are expensive

to build, since they are manually constructed, and scarce. We

present an unsupervised approach to the acquisition and semantic

classification of collocations according to LFs, based on word

embeddings in which, given an example of a collocation for each

of the target LFs and a set of bases, the system retrieves a list of

collocates for each base and LF.

MWEs in Treebanks: From Survey to GuidelinesVictoria Rosén, Koenraad De Smedt, Gyri SmørdalLosnegaard, Eduard Bejcek, Agata Savary and PetyaOsenova

By means of an online survey, we have investigated ways in

which various types of multiword expressions are annotated in

existing treebanks. The results indicate that there is considerable

variation in treatments across treebanks and thereby also, to

some extent, across languages and across theoretical frameworks.

The comparison is focused on the annotation of light verb

constructions and verbal idioms. The survey shows that the light

80

verb constructions either get special annotations as such, or are

treated as ordinary verbs, while VP idioms are handled through

different strategies. Based on insights from our investigation,

we propose some general guidelines for annotating multiword

expressions in treebanks. The recommendations address the

following application-based needs: distinguishing MWEs from

similar but compositional constructions; searching distinct types

of MWEs in treebanks; awareness of literal and nonliteral

meanings; and normalization of the MWE representation. The

cross-lingually and cross-theoretically focused survey is intended

as an aid to accessing treebanks and an aid for further work on

treebank annotation.

Multiword Expressions Dataset for IndianLanguages

Dhirendra Singh, Sudha Bhingardive and PushpakBhattacharya

Multiword Expressions (MWEs) are used frequently in natural

languages, but understanding the diversity in MWEs is one of the

open problem in the area of Natural Language Processing. In the

context of Indian languages, MWEs play an important role. In

this paper, we present MWEs annotation dataset created for Indian

languages viz., Hindi and Marathi. We extract possible MWE

candidates using two repositories: 1) the POS-tagged corpus and

2) the IndoWordNet synsets. Annotation is done for two types

of MWEs: compound nouns and light verb constructions. In the

process of annotation, human annotators tag valid MWEs from

these candidates based on the standard guidelines provided to

them. We obtained 3178 compound nouns and 2556 light verb

constructions in Hindi and 1003 compound nouns and 2416 light

verb constructions in Marathi using two repositories mentioned

before. This created resource is made available publicly and can

be used as a gold standard for Hindi and Marathi MWE systems.

P29 - Treebanks (2)Thursday, May 26, 11:45

Chairperson: Claire Bonial Poster Session

MarsaGram: an excursion in the forests ofparsing trees

Philippe Blache, Stéphane Rauzy and Grégoire Montcheuil

The question of how to compare languages and more generally

the domain of linguistic typology, relies on the study of

different linguistic properties or phenomena. Classically, such

a comparison is done semi-manually, for example by extracting

information from databases such as the WALS. However, it

remains difficult to identify precisely regular parameters, available

for different languages, that can be used as a basis towards

modeling. We propose in this paper, focusing on the question

of syntactic typology, a method for automatically extracting

such parameters from treebanks, bringing them into a typology

perspective. We present the method and the tools for inferring

such information and navigating through the treebanks. The

approach has been applied to 10 languages of the Universal

Dependencies Treebank. We approach is evaluated by showing

how automatic classification correlates with language families.

EasyTree: A Graphical Tool for Dependency TreeAnnotation

Alexa Little and Stephen Tratz

This paper introduces EasyTree, a dynamic graphical tool for

dependency tree annotation. Built in JavaScript using the popular

D3 data visualization library, EasyTree allows annotators to

construct and label trees entirely by manipulating graphics, and

then export the corresponding data in JSON format. Human users

are thus able to annotate in an intuitive way without compromising

the machine-compatibility of the output. EasyTree has a number

of features to assist annotators, including color-coded part-of-

speech indicators and optional translation displays. It can also

be customized to suit a wide range of projects; part-of-speech

categories, edge labels, and many other settings can be edited

from within the GUI. The system also utilizes UTF-8 encoding

and properly handles both left-to-right and right-to-left scripts.

By providing a user-friendly annotation tool, we aim to reduce

time spent transforming data or learning to use the software,

to improve the user experience for annotators, and to make

annotation approachable even for inexperienced users. Unlike

existing solutions, EasyTree is built entirely with standard web

technologies–JavaScript, HTML, and CSS–making it ideal for

web-based annotation efforts, including crowdsourcing efforts.

Hypergraph Modelization of a SyntacticallyAnnotated English Wikipedia Dump

Edmundo Pavel Soriano Morales, Julien Ah-Pine andSabine Loudcher

Wikipedia, the well known internet encyclopedia, is nowadays

a widely used source of information. To leverage its rich

information, already parsed versions of Wikipedia have been

proposed. We present an annotated dump of the English

Wikipedia. This dump draws upon previously released Wikipedia

parsed dumps. Still, we head in a different direction. In

this parse we focus more into the syntactical characteristics of

words: aside from the classical Part-of-Speech (PoS) tags and

dependency parsing relations, we provide the full constituent

81

parse branch for each word in a succinct way. Additionally,

we propose a hypergraph network representation of the extracted

linguistic information. The proposed modelization aims to take

advantage of the information stocked within our parsed Wikipedia

dump. We hope that by releasing these resources, researchers

from the concerned communities will have a ready-to-experiment

Wikipedia corpus to compare and distribute their work. We render

public our parsed Wikipedia dump as well as the tool (and its

source code) used to perform the parse. The hypergraph network

and its related metadata is also distributed.

Detecting Annotation Scheme Variation inOut-of-Domain TreebanksYannick Versley and Julius Steen

To ensure portability of NLP systems across multiple domains,

existing treebanks are often extended by adding trees from

interesting domains that were not part of the initial annotation

effort. In this paper, we will argue that it is both useful

from an application viewpoint and enlightening from a linguistic

viewpoint to detect and reduce divergence in annotation schemes

between extant and new parts in a set of treebanks that is to be

used in evaluation experiments. The results of our correction and

harmonization efforts will be made available to the public as a test

suite for the evaluation of constituent parsing.

Universal Dependencies for PersianMojgan Seraji, Filip Ginter and Joakim Nivre

The Persian Universal Dependency Treebank (Persian UD) is a

recent effort of treebanking Persian with Universal Dependencies

(UD), an ongoing project that designs unified and cross-

linguistically valid grammatical representations including part-of-

speech tags, morphological features, and dependency relations.

The Persian UD is the converted version of the Uppsala Persian

Dependency Treebank (UPDT) to the universal dependencies

framework and consists of nearly 6,000 sentences and 152,871

word tokens with an average sentence length of 25 words.

In addition to the universal dependencies syntactic annotation

guidelines, the two treebanks differ in tokenization. All words

containing unsegmented clitics (pronominal and copula clitics)

annotated with complex labels in the UPDT have been separated

from the clitics and appear with distinct labels in the Persian UD.

The treebank has its original syntactic annotation scheme based

on Stanford Typed Dependencies. In this paper, we present the

approaches taken in the development of the Persian UD.

Hard Time Parsing Questions: Building aQuestionBank for FrenchDjamé Seddah and Marie Candito

We present the French Question Bank, a treebank of 2600

questions. We show that classical parsing model performance

drop while the inclusion of this data set is highly beneficial

without harming the parsing of non-question data. when facing

out-of- domain data with strong structural diver- gences. Two

thirds being aligned with the QB (Judge et al., 2006) and being

freely available, this treebank will prove useful to build robust

NLP systems.

Enhanced English Universal Dependencies: AnImproved Representation for Natural LanguageUnderstanding Tasks

Sebastian Schuster and Christopher D. Manning

Many shallow natural language understanding tasks use

dependency trees to extract relations between content words.

However, strict surface-structure dependency trees tend to follow

the linguistic structure of sentences too closely and frequently fail

to provide direct relations between content words. To mitigate

this problem, the original Stanford Dependencies representation

also defines two dependency graph representations which contain

additional and augmented relations that explicitly capture

otherwise implicit relations between content words. In this paper,

we revisit and extend these dependency graph representations in

light of the recent Universal Dependencies (UD) initiative and

provide a detailed account of an enhanced and an enhanced++

English UD representation. We further present a converter from

constituency to basic, i.e., strict surface structure, UD trees, and

a converter from basic UD trees to enhanced and enhanced++

English UD graphs. We release both converters as part of Stanford

CoreNLP and the Stanford Parser.

A Proposition Bank of Urdu

Maaz Anwar, Riyaz Ahmad Bhat, Dipti Sharma, AshwiniVaidya, Martha Palmer and Tafseer Ahmed Khan

This paper describes our efforts for the development of a

Proposition Bank for Urdu, an Indo-Aryan language. Our primary

goal is the labeling of syntactic nodes in the existing Urdu

dependency Treebank with specific argument labels. In essence,

it involves annotation of predicate argument structures of both

simple and complex predicates in the Treebank corpus. We

describe the overall process of building the PropBank of Urdu.

We discuss various statistics pertaining to the Urdu PropBank and

the issues which the annotators encountered while developing the

PropBank. We also discuss how these challenges were addressed

to successfully expand the PropBank corpus. While reporting

the Inter-annotator agreement between the two annotators, we

show that the annotators share similar understanding of the

annotation guidelines and of the linguistic phenomena present in

the language. The present size of this Propbank is around 180,000

tokens which is double-propbanked by the two annotators for

82

simple predicates. Another 100,000 tokens have been annotated

for complex predicates of Urdu.

Czech Legal Text Treebank 1.0

Vincent Kríž, Barbora Hladka and Zdenka Uresova

We introduce a new member of the family of Prague

dependency treebanks. The Czech Legal Text Treebank 1.0 is

a morphologically and syntactically annotated corpus of 1,128

sentences. The treebank contains texts from the legal domain,

namely the documents from the Collection of Laws of the Czech

Republic. Legal texts differ from other domains in several

language phenomena influenced by rather high frequency of very

long sentences. A manual annotation of such sentences presents

a new challenge. We describe a strategy and tools for this task.

The resulting treebank can be explored in various ways. It can be

downloaded from the LINDAT/CLARIN repository and viewed

locally using the TrEd editor or it can be accessed on-line using

the KonText and TreeQuery tools.

P30 - Linked DataThursday, May 26, 14:55

Chairperson: Felix Sasaki Poster Session

Concepticon: A Resource for the Linking ofConcept Lists

Johann-Mattis List, Michael Cysouw and Robert Forkel

We present an attempt to link the large amount of different concept

lists which are used in the linguistic literature, ranging from

Swadesh lists in historical linguistics to naming tests in clinical

studies and psycholinguistics. This resource, our Concepticon,

links 30 222 concept labels from 160 conceptlists to 2495 concept

sets. Each concept set is given a unique identifier, a unique

label, and a human-readable definition. Concept sets are further

structured by defining different relations between the concepts.

The resource can be used for various purposes. Serving as

a rich reference for new and existing databases in diachronic

and synchronic linguistics, it allows researchers a quick access

to studies on semantic change, cross-linguistic polysemies, and

semantic associations.

LVF-lemon – Towards a Linked DataRepresentation of “Les Verbes français”

Ingrid Falk and Achim Stein

In this study we elaborate a road map for the conversion of a

traditional lexical syntactico-semantic resource for French into

a linguistic linked open data (LLOD) model. Our approach

uses current best-practices and the analyses of earlier similar

undertakings (lemonUBY and PDEV-lemon) to tease out the most

appropriate representation for our resource.

Riddle Generation using Word Associations

Paloma Galvan, Virginia Francisco, Raquel Hervas andGonzalo Mendez

In knowledge bases where concepts have associated properties,

there is a large amount of comparative information that is

implicitly encoded in the values of the properties these concepts

share. Although there have been previous approaches to

generating riddles, none of them seem to take advantage

of structured information stored in knowledge bases such as

Thesaurus Rex, which organizes concepts according to the fine

grained ad-hoc categories they are placed into by speakers in

everyday language, along with associated properties or modifiers.

Taking advantage of these shared properties, we have developed

a riddle generator that creates riddles about concepts represented

as common nouns. The base of these riddles are comparisons

between the target concept and other entities that share some of

its properties. In this paper, we describe the process we have

followed to generate the riddles starting from the target concept

and we show the results of the first evaluation we have carried out

to test the quality of the resulting riddles.

Challenges of Adjective Mapping betweenplWordNet and Princeton WordNet

Ewa Rudnicka, Wojciech Witkowski and KatarzynaPodlaska

The paper presents the strategy and results of mapping adjective

synsets between plWordNet (the wordnet of Polish, cf. Piasecki

et al. 2009, Maziarz et al. 2013) and Princeton WordNet

(cf. Fellbaum 1998). The main challenge of this enterprise

has been very different synset relation structures in the two

networks: horizontal, dumbbell-model based in PWN and vertical,

hyponymy-based in plWN. Moreover, the two wordnets display

differences in the grouping of adjectives into semantic domains

and in the size of the adjective category. The handle the above

contrasts, a series of automatic prompt algorithms and a manual

mapping procedure relying on corresponding synset and lexical

unit relations as well as on inter-lingual relations between noun

synsets were proposed in the pilot stage of mapping (Rudnicka et

al. 2015). In the paper we discuss the final results of the mapping

83

process as well as explain example mapping choices. Suggestions

for further development of mapping are also given.

Relation- and Phrase-level Linking of FrameNetwith Sar-graphs

Aleksandra Gabryszak, Sebastian Krause, LeonhardHennig, Feiyu Xu and Hans Uszkoreit

Recent research shows the importance of linking linguistic

knowledge resources for the creation of large-scale linguistic

data. We describe our approach for combining two English

resources, FrameNet and sar-graphs, and illustrate the benefits of

the linked data in a relation extraction setting. While FrameNet

consists of schematic representations of situations, linked to

lexemes and their valency patterns, sar-graphs are knowledge

resources that connect semantic relations from factual knowledge

graphs to the linguistic phrases used to express instances of these

relations. We analyze the conceptual similarities and differences

of both resources and propose to link sar-graphs and FrameNet

on the levels of relations/frames as well as phrases. The former

alignment involves a manual ontology mapping step, which allows

us to extend sar-graphs with new phrase patterns from FrameNet.

The phrase-level linking, on the other hand, is fully automatic. We

investigate the quality of the automatically constructed links and

identify two main classes of errors.

Mapping Ontologies Using Ontologies:Cross-lingual Semantic Role Information Transfer

Balázs Indig, Márton Miháltz and András Simonyi

This paper presents the process of enriching the verb frame

database of a Hungarian natural language parser to enable the

assignment of semantic roles. We accomplished this by linking

the parser’s verb frame database to existing linguistic resources

such as VerbNet and WordNet, and automatically transferring

back semantic knowledge. We developed OWL ontologies that

map the various constraint description formalisms of the linked

resources and employed a logical reasoning device to facilitate the

linking procedure. We present results and discuss the challenges

and pitfalls that arose from this undertaking.

Generating a Large-Scale Entity LinkingDictionary from Wikipedia Link Structure andArticle Text

Ravindra Harige and Paul Buitelaar

Wikipedia has been increasingly used as a knowledge base for

open-domain Named Entity Linking and Disambiguation. In this

task, a dictionary with entity surface forms plays an important

role in finding a set of candidate entities for the mentions in text.

Existing dictionaries mostly rely on the Wikipedia link structure,

like anchor texts, redirect links and disambiguation links. In this

paper, we introduce a dictionary for Entity Linking that includes

name variations extracted from Wikipedia article text, in addition

to name variations derived from the Wikipedia link structure. With

this approach, we show an increase in the coverage of entities and

their mentions in the dictionary in comparison to other Wikipedia

based dictionaries.

The Open Linguistics Working Group:Developing the Linguistic Linked Open DataCloud

John Philip McCrae, Christian Chiarcos, Francis Bond,Philipp Cimiano, Thierry Declerck, Gerard de Melo,Jorge Gracia, Sebastian Hellmann, Bettina Klimek, StevenMoran, Petya Osenova, Antonio Pareja-Lora and JonathanPool

The Open Linguistics Working Group (OWLG) brings together

researchers from various fields of linguistics, natural language

processing, and information technology to present and discuss

principles, case studies, and best practices for representing,

publishing and linking linguistic data collections. A major

outcome of our work is the Linguistic Linked Open Data (LLOD)

cloud, an LOD (sub-)cloud of linguistic resources, which covers

various linguistic databases, lexicons, corpora, terminologies, and

metadata repositories. We present and summarize five years of

progress on the development of the cloud and of advancements

in open data in linguistics, and we describe recent community

activities. The paper aims to serve as a guideline to orient and

involve researchers with the community and/or Linguistic Linked

Open Data.

Cross-lingual RDF Thesauri Interlinking

Tatiana Lesnikova, Jérôme David and Jérôme Euzenat

Various lexical resources are being published in RDF. To enhance

the usability of these resources, identical resources in different

data sets should be linked. If lexical resources are described

in different natural languages, then techniques to deal with

multilinguality are required for interlinking. In this paper,

we evaluate machine translation for interlinking concepts, i.e.,

generic entities named with a common noun or term. In our

previous work, the evaluated method has been applied on named

entities. We conduct two experiments involving different thesauri

in different languages. The first experiment involves concepts

from the TheSoz multilingual thesaurus in three languages:

English, French and German. The second experiment involves

concepts from the EuroVoc and AGROVOC thesauri in English

and Chinese respectively. Our results demonstrate that machine

84

translation can be beneficial for cross-lingual thesauri interlinking

independently of a dataset structure.

P31 - LR Infrastructures and Architectures (1)Thursday, May 26, 14:55

Chairperson: Yohei Murakami Poster Session

The Language Resource Life Cycle: Towards aGeneric Model for Creating, Maintaining, Usingand Distributing Language Resources

Georg Rehm

Language Resources (LRs) are an essential ingredient of current

approaches in Linguistics, Computational Linguistics, Language

Technology and related fields. LRs are collections of spoken or

written language data, typically annotated with linguistic analysis

information. Different types of LRs exist, for example, corpora,

ontologies, lexicons, collections of spoken language data (audio),

or collections that also include video (multimedia, multimodal).

Often, LRs are distributed with specific tools, documentation,

manuals or research publications. The different phases that

involve creating and distributing an LR can be conceptualised as

a life cycle. While the idea of handling the LR production and

maintenance process in terms of a life cycle has been brought up

quite some time ago, a best practice model or common approach

can still be considered a research gap. This article wants to help

fill this gap by proposing an initial version of a generic Language

Resource Life Cycle that can be used to inform, direct, control

and evaluate LR research and development activities (including

description, management, production, validation and evaluation

workflows).

A Large-scale Recipe and Meal Data Collection asInfrastructure for Food Research

Jun Harashima, Michiaki Ariga, Kenta Murata andMasayuki Ioki

Everyday meals are an important part of our daily lives and,

currently, there are many Internet sites that help us plan these

meals. Allied to the growth in the amount of food data such

as recipes available on the Internet is an increase in the number

of studies on these data, such as recipe analysis and recipe

search. However, there are few publicly available resources for

food research; those that do exist do not include a wide range

of food data or any meal data (that is, likely combinations of

recipes). In this study, we construct a large-scale recipe and

meal data collection as the underlying infrastructure to promote

food research. Our corpus consists of approximately 1.7 million

recipes and 36000 meals in cookpad, one of the largest recipe

sites in the world. We made the corpus available to researchers

in February 2015 and as of February 2016, 82 research groups at

56 universities have made use of it to enhance their studies.

EstNLTK - NLP Toolkit for EstonianSiim Orasmaa, Timo Petmanson, Alexander Tkachenko,Sven Laur and Heiki-Jaan Kaalep

Although there are many tools for natural language processing

tasks in Estonian, these tools are very loosely interoperable,

and it is not easy to build practical applications on top of

them. In this paper, we introduce a new Python library for

natural language processing in Estonian, which provides unified

programming interface for various NLP components. The

EstNLTK toolkit provides utilities for basic NLP tasks including

tokenization, morphological analysis, lemmatisation and named

entity recognition as well as offers more advanced features such

as a clause segmentation, temporal expression extraction and

normalization, verb chain detection, Estonian Wordnet integration

and rule-based information extraction. Accompanied by a detailed

API documentation and comprehensive tutorials, EstNLTK is

suitable for a wide range of audience. We believe EstNLTK is

mature enough to be used for developing NLP-backed systems

both in industry and research. EstNLTK is freely available under

the GNU GPL version 2+ license, which is standard for academic

software.

South African National Centre for DigitalLanguage ResourcesJustus Roux

This presentation introduces the imminent establishment of a new

language resource infrastructure focusing on languages spoken in

Southern Africa, with an eventual aim to become a hub for digital

language resources within Sub-Saharan Africa. The Constitution

of South Africa makes provision for 11 official languages all with

equal status. The current language Resource Management Agency

will be merged with the new Centre, which will have a wider

focus than that of data acquisition, management and distribution.

The Centre will entertain two main programs: Digitisation and

Digital Humanities. The digitisation program will focus on the

systematic digitisation of relevant text, speech and multi-modal

data across the official languages. Relevancy will be determined

by a Scientific Advisory Board. This will take place on a

continuous basis through specified projects allocated to national

members of the Centre, as well as through open-calls aimed at

the academic as well as local communities. The digital resources

will be managed and distributed through a dedicated web-based

portal. The development of the Digital Humanities program

will entail extensive academic support for projects implementing

digital language based data. The Centre will function as an

85

enabling research infrastructure primarily supported by national

government and hosted by the North-West University.

Design and Development of the MERLIN LearnerCorpus Platform

Verena Lyding and Karin Schöne

In this paper, we report on the design and development of an

online search platform for the MERLIN corpus of learner texts

in Czech, German and Italian. It was created in the context of the

MERLIN project, which aims at empirically illustrating features

of the Common European Framework of Reference (CEFR) for

evaluating language competences based on authentic learner text

productions compiled into a learner corpus. Furthermore, the

project aims at providing access to the corpus through a search

interface adapted to the needs of multifaceted target groups

involved with language learning and teaching. This article starts

by providing a brief overview on the project ambition, the data

resource and its intended target groups. Subsequently, the main

focus of the article is on the design and development process of

the platform, which is carried out in a user-centred fashion. The

paper presents the user studies carried out to collect requirements,

details the resulting decisions concerning the platform design and

its implementation, and reports on the evaluation of the platform

prototype and final adjustments.

FLAT: Constructing a CLARIN CompatibleHome for Language Resources

Menzo Windhouwer, Marc Kemps-Snijders, Paul Trilsbeek,André Moreira, Bas Van der Veen, Guilherme Silva andDaniel Von Reihn

Language resources are valuable assets, both for institutions

and researchers. To safeguard these resources requirements for

repository systems and data management have been specified by

various branch organizations, e.g., CLARIN and the Data Seal

of Approval. This paper describes these and some additional

ones posed by the authors’ home institutions. And it shows

how they are met by FLAT, to provide a new home for language

resources. The basis of FLAT is formed by the Fedora Commons

repository system. This repository system can meet many of

the requirements out-of-the box, but still additional configuration

and some development work is needed to meet the remaining

ones, e.g., to add support for Handles and Component Metadata.

This paper describes design decisions taken in the construction of

FLAT’s system architecture via a mix-and-match strategy, with a

preference for the reuse of existing solutions. FLAT is developed

and used by the Meertens Institute and The Language Archive, but

is also freely available for anyone in need of a CLARIN-compliant

repository for their language resources.

CLARIAH in the Netherlands

Jan Odijk

I introduce CLARIAH in the Netherlands, which aims to

contribute the Netherlands part of a Europe-wide humanities

research infrastructure. I describe the digital turn in the

humanities, the background and context of CLARIAH, both

nationally and internationally, its relation to the CLARIN and

DARIAH infrastructures, and the rationale for joining forces

between CLARIN and DARIAH in the Netherlands. I also

describe the first results of joining forces as achieved in the

CLARIAH-SEED project, and the plans of the CLARIAH-CORE

project, which is currently running

Crosswalking from CMDI to Dublin Core andMARC 21

Claus Zinn, Thorsten Trippel, Steve Kaminski and EmanuelDima

The Component MetaData Infrastructure (CMDI) is a framework

for the creation and usage of metadata formats to describe all

kinds of resources in the CLARIN world. To better connect to

the library world, and to allow librarians to enter metadata for

linguistic resources into their catalogues, a crosswalk from CMDI-

based formats to bibliographic standards is required. The general

and rather fluid nature of CMDI, however, makes it hard to map

arbitrary CMDI schemas to metadata standards such as Dublin

Core (DC) or MARC 21, which have a mature, well-defined and

fixed set of field descriptors. In this paper, we address the issue

and propose crosswalks between CMDI-based profiles originating

from the NaLiDa project and DC and MARC 21, respectively.

Data Management Plans and Data Centers

Denise DiPersio, Christopher Cieri and Daniel Jaquette

Data management plans, data sharing plans and the like are now

required by funders worldwide as part of research proposals.

Concerned with promoting the notion of open scientific data,

funders view such plans as the framework for satisfying the

generally accepted requirements for data generated in funded

research projects, among them that it be accessible, usable,

standardized to the degree possible, secure and stable. This

paper examines the origins of data management plans, their

86

requirements and issues they raise for data centers and HLT

resource development in general.

UIMA-Based JCoRe 2.0 Goes GitHub and MavenCentral — State-of-the-Art Software ResourceEngineering and Distribution of NLP Pipelines

Udo Hahn, Franz Matthies, Erik Faessler and JohannesHellrich

We introduce JCoRe 2.0, the relaunch of a UIMA-based open

software repository for full-scale natural language processing

originating from the Jena University Language & Information

Engineering (JULIE) Lab. In an attempt to put the new release

of JCoRe on firm software engineering ground, we uploaded it to

GitHub, a social coding platform, with an underlying source code

versioning system and various means to support collaboration

for software development and code modification management.

In order to automate the builds of complex NLP pipelines and

properly represent and track dependencies of the underlying Java

code, we incorporated Maven as part of our software configuration

management efforts. In the meantime, we have deployed our

artifacts on Maven Central, as well. JCoRe 2.0 offers a broad

range of text analytics functionality (mostly) for English-language

scientific abstracts and full-text articles, especially from the life

sciences domain.

Facilitating Metadata Interoperability inCLARIN-DK

Lene Offersgaard and Dorte Haltrup Hansen

The issue for CLARIN archives at the metadata level is to facilitate

the user’s possibility to describe their data, even with their own

standard, and at the same time make these metadata meaningful

for a variety of users with a variety of resource types, and ensure

that the metadata are useful for search across all resources both

at the national and at the European level. We see that different

people from different research communities fill in the metadata

in different ways even though the metadata was defined and

documented. This has impacted when the metadata are harvested

and displayed in different environments. A loss of information is

at stake. In this paper we view the challenges of ensuring metadata

interoperability through examples of propagation of metadata

values from the CLARIN-DK archive to the VLO. We see that

the CLARIN Community in many ways support interoperability,

but argue that agreeing upon standards, making clear definitions of

the semantics of the metadata and their content is inevitable for the

interoperability to work successfully. The key points are clear and

freely available definitions, accessible documentation and easily

usable facilities and guidelines for the metadata creators.

The Language Application Grid and Galaxy

Nancy Ide, Keith Suderman, James Pustejovsky, MarcVerhagen and Christopher Cieri

The NSF-SI2-funded LAPPS Grid project is a collaborative

effort among Brandeis University, Vassar College, Carnegie-

Mellon University (CMU), and the Linguistic Data Consortium

(LDC), which has developed an open, web-based infrastructure

through which resources can be easily accessed and within

which tailored language services can be efficiently composed,

evaluated, disseminated and consumed by researchers, developers,

and students across a wide variety of disciplines. The LAPPS

Grid project recently adopted Galaxy (Giardine et al., 2005),

a robust, well-developed, and well-supported front end for

workflow configuration, management, and persistence. Galaxy

allows data inputs and processing steps to be selected from

graphical menus, and results are displayed in intuitive plots

and summaries that encourage interactive workflows and the

exploration of hypotheses. The Galaxy workflow engine provides

significant advantages for deploying pipelines of LAPPS Grid web

services, including not only means to create and deploy locally-

run and even customized versions of the LAPPS Grid as well as

running the LAPPS Grid in the cloud, but also access to a huge

array of statistical and visualization tools that have been developed

for use in genomics research.

P32 - Large Projects and Infrastructures (1)Thursday, May 26, 14:55

Chairperson: Zygmunt Vetulani Poster Session

The IPR-cleared Corpus of ContemporaryWritten and Spoken Romanian Language

Dan Tufis, Verginica Barbu Mititelu, Elena Irimia, StefanDaniel Dumitrescu and Tiberiu Boros

The article describes the current status of a large national

project, CoRoLa, aiming at building a reference corpus for

the contemporary Romanian language. Unlike many other

national corpora, CoRoLa contains only - IPR cleared texts

and speech data, obtained from some of the country’s most

representative publishing houses, broadcasting agencies, editorial

offices, newspapers and popular bloggers. For the written

component 500 million tokens are targeted and for the oral one

300 hours of recordings. The choice of texts is done according

to their functional style, domain and subdomain, also with an

eye to the international practice. A metadata file (following

87

the CMDI model) is associated to each text file. Collected

texts are cleaned and transformed in a format compatible with

the tools for automatic processing (segmentation, tokenization,

lemmatization, part-of-speech tagging). The paper also presents

up-to-date statistics about the structure of the corpus almost two

years before its official launching. The corpus will be freely

available for searching. Users will be able to download the

results of their searches and those original files when not against

stipulations in the protocols we have with text providers.

SYN2015: Representative Corpus ofContemporary Written Czech

Michal Kren, Václav Cvrcek, Tomáš Capka, AnnaCermáková, Milena Hnátková, Lucie Chlumská, TomášJelínek, Dominika Kováríková, Vladimír Petkevic, PavelProcházka, Hana Skoumalová, Michal Škrabal, PetrTrunecek, Pavel Vondricka and Adrian Jan Zasina

The paper concentrates on the design, composition and annotation

of SYN2015, a new 100-million representative corpus of

contemporary written Czech. SYN2015 is a sequel of the

representative corpora of the SYN series that can be described

as traditional (as opposed to the web-crawled corpora), featuring

cleared copyright issues, well-defined composition, reliability

of annotation and high-quality text processing. At the same

time, SYN2015 is designed as a reflection of the variety of

written Czech text production with necessary methodological and

technological enhancements that include a detailed bibliographic

annotation and text classification based on an updated scheme.

The corpus has been produced using a completely rebuilt text

processing toolchain called SynKorp. SYN2015 is lemmatized,

morphologically and syntactically annotated with state-of-the-art

tools. It has been published within the framework of the Czech

National Corpus and it is available via the standard corpus query

interface KonText at http://kontext.korpus.cz as well

as a dataset in shuffled format.

LREC as a Graph: People and Resources in aNetwork

Riccardo Del Gratta, Francesca Frontini, MonicaMonachini, Gabriella Pardelli, Irene Russo, RobertoBartolini, Fahad Khan, Claudia Soria and NicolettaCalzolari

This proposal describes a new way to visualise resources in

the LREMap, a community-built repository of language resource

descriptions and uses. The LREMap is represented as a force-

directed graph, where resources, papers and authors are nodes.

The analysis of the visual representation of the underlying graph

is used to study how the community gathers around LRs and how

LRs are used in research.

The Public License Selector: Making OpenLicensing EasierPawel Kamocki, Pavel Stranák and Michal Sedlák

Researchers in Natural Language Processing rely on availability

of data and software, ideally under open licenses, but little is

done to actively encourage it. In fact, the current Copyright

framework grants exclusive rights to authors to copy their works,

make them available to the public and make derivative works (such

as annotated language corpora). Moreover, in the EU databases

are protected against unauthorized extraction and re-utilization of

their contents. Therefore, proper public licensing plays a crucial

role in providing access to research data. A public license is a

license that grants certain rights not to one particular user, but to

the general public (everybody). Our article presents a tool that we

developed and whose purpose is to assist the user in the licensing

process. As software and data should be licensed under different

licenses, the tool is composed of two separate parts: Data and

Software. The underlying logic as well as elements of the graphic

interface are presented below.

NLP Infrastructure for the Lithuanian LanguageDaiva Vitkut-Adžgauskien, Andrius Utka, DariusAmilevicius and Tomas Krilavicius

The Information System for Syntactic and Semantic Analysis

of the Lithuanian language (lith. Lietuvi kalbos sintaksins ir

semantins analizs informacin sistema, LKSSAIS) is the first

infrastructure for the Lithuanian language combining Lithuanian

language tools and resources for diverse linguistic research and

applications tasks. It provides access to the basic as well

as advanced natural language processing tools and resources,

including tools for corpus creation and management, text

preprocessing and annotation, ontology building, named entity

recognition, morphosyntactic and semantic analysis, sentiment

analysis, etc. It is an important platform for researchers and

developers in the field of natural language technology.

CodE Alltag: A German-Language E-MailCorpusUlrike Krieg-Holz, Christian Schuschnig, Franz Matthies,Benjamin Redling and Udo Hahn

We introduce CODE ALLTAG, a text corpus composed of

German-language e-mails. It is divided into two partitions: the

first of these portions, CODE ALLTAG_ XL, consists of a bulk-

size collection drawn from an openly accessible e-mail archive

(roughly 1.5M e-mails), whereas the second portion, CODE

ALLTAG_ S+d, is much smaller in size (less than thousand e-

mails), yet excels with demographic data from each author of an

88

e-mail. CODE ALLTAG, thus, currently constitutes the largest E-

Mail corpus ever built. In this paper, we describe, for both parts,

the solicitation process for gathering e-mails, present descriptive

statistical properties of the corpus, and, for CODE ALLTAG_ S+d,

reveal a compilation of demographic features of the donors of e-

mails.

P33 - Morphology (2)Thursday, May 26, 14:55

Chairperson: Felice dell’Orletta Poster Session

Rapid Development of Morphological Analyzersfor Typologically Diverse Languages

Seth Kulick and Ann Bies

The Low Resource Language research conducted under DARPA’s

Broad Operational Language Translation (BOLT) program

required the rapid creation of text corpora of typologically

diverse languages (Turkish, Hausa, and Uzbek) which were

annotated with morphological information, along with other types

of annotation. Since the output of morphological analyzers is

a significant aid to morphological annotation, we developed a

morphological analyzer for each language in order to support

the annotation task, and also as a deliverable by itself. Our

framework for analyzer creation results in tables similar to those

used in the successful SAMA analyzer for Arabic, but with a more

abstract linguistic level, from which the tables are derived. A

lexicon was developed from available resources for integration

with the analyzer, and given the speed of development and

uncertain coverage of the lexicon, we assumed that the analyzer

would necessarily be lacking in some coverage for the project

annotation. Our analyzer framework was therefore focused on

rapid implementation of the key structures of the language,

together with accepting “wildcard” solutions as possible analyses

for a word with an unknown stem, building upon our similar

experiences with morphological annotation with Modern Standard

Arabic and Egyptian Arabic.

A Neural Lemmatizer for Bengali

Abhisek Chakrabarty, Akshay Chaturvedi and UtpalGarain

We propose a novel neural lemmatization model which is

language independent and supervised in nature. To handle the

words in a neural framework, word embedding technique is

used to represent words as vectors. The proposed lemmatizer

makes use of contextual information of the surface word to be

lemmatized. Given a word along with its contextual neighbours

as input, the model is designed to produce the lemma of

the concerned word as output. We introduce a new network

architecture that permits only dimension specific connections

between the input and the output layer of the model. For the

present work, Bengali is taken as the reference language. Two

datasets are prepared for training and testing purpose consisting

of 19,159 and 2,126 instances respectively. As Bengali is a

resource scarce language, these datasets would be beneficial for

the respective research community. Evaluation method shows

that the neural lemmatizer achieves 69.57% accuracy on the

test dataset and outperforms the simple cosine similarity based

baseline strategy by a margin of 1.37%.

A Finite-state Morphological Analyser for Tuvan

Francis Tyers, Aziyana Bayyr-ool, Aelita Salchak andJonathan Washington

This paper describes the development of free/open-source

finite-state morphological transducers for Tuvan, a Turkic

language spoken in and around the Tuvan Republic in Russia.

The finite-state toolkit used for the work is the Helsinki

Finite-State Toolkit (HFST), we use the lexc formalism for

modelling the morphotactics and twol formalism for modelling

morphophonological alternations. We present a novel description

of the morphological combinatorics of pseudo-derivational

morphemes in Tuvan. An evaluation is presented which shows

that the transducer has a reasonable coverage–around 93%–on

freely-available corpora of the languages, and high precision–over

99%–on a manually verified test set.

Tzaurs.lv: the Largest Open Lexical Database forLatvian

Andrejs Spektors, Ilze Auzina, Roberts Daris, NormundsGrzitis, Pteris Paikens, Lauma Pretkalnina, Laura Ritumaand Baiba Saulite

We describe an extensive and versatile lexical resource for

Latvian, an under-resourced Indo-European language, which we

call Tezaurs (Latvian for ‘thesaurus’). It comprises a large

explanatory dictionary of more than 250,000 entries that are

derived from more than 280 external sources. The dictionary

is enriched with phonetic, morphological, semantic and other

annotations, as well as augmented by various language processing

tools allowing for the generation of inflectional forms and

pronunciation, for on-the-fly selection of corpus examples, for

suggesting synonyms, etc. Tezaurs is available as a public and

widely used web application for end-users, as an open data set

for the use in language technology (LT), and as an API – a set of

web services for the integration into third-party applications. The

ultimate goal of Tezaurs is to be the central computational lexicon

for Latvian, bringing together all Latvian words and frequently

89

used multi-word units and allowing for the integration of other LT

resources and tools.

A Finite-State Morphological Analyser for Sindhi

Raveesh Motlani, Francis Tyers and Dipti Sharma

Morphological analysis is a fundamental task in natural-language

processing, which is used in other NLP applications such as

part-of-speech tagging, syntactic parsing, information retrieval,

machine translation, etc. In this paper, we present our work on

the development of free/open-source finite-state morphological

analyser for Sindhi. We have used Apertium’s lttoolbox as our

finite-state toolkit to implement the transducer. The system is

developed using a paradigm-based approach, wherein a paradigm

defines all the word forms and their morphological features for a

given stem (lemma). We have evaluated our system on the Sindhi

Wikipedia corpus and achieved a reasonable coverage of 81% and

a precision of over 97%.

Deriving Morphological Analyzers from ExampleInflections

Markus Forsberg and Mans Hulden

This paper presents a semi-automatic method to derive

morphological analyzers from a limited number of example

inflections suitable for languages with alphabetic writing systems.

The system we present learns the inflectional behavior of

morphological paradigms from examples and converts the learned

paradigms into a finite-state transducer that is able to map inflected

forms of previously unseen words into lemmas and corresponding

morphosyntactic descriptions. We evaluate the system when

provided with inflection tables for several languages collected

from the Wiktionary.

Morphological Analysis of Sahidic Coptic forAutomatic Glossing

Daniel Smith and Mans Hulden

We report on the implementation of a morphological analyzer for

the Sahidic dialect of Coptic, a now extinct Afro-Asiatic language.

The system is developed in the finite-state paradigm. The main

purpose of the project is provide a method by which scholars

and linguists can semi-automatically gloss extant texts written in

Sahidic. Since a complete lexicon containing all attested forms

in different manuscripts requires significant expertise in Coptic

spanning almost 1,000 years, we have equipped the analyzer with

a core lexicon and extended it with a “guesser” ability to capture

out-of-vocabulary items in any inflection. We also suggest an

ASCII transliteration for the language. A brief evaluation is

provided.

The on-line version of Grammatical Dictionary ofPolish

Marcin Wolinski and Witold Kieras

We present the new online edition of a dictionary of Polish

inflection – the Grammatical Dictionary of Polish (http://

sgjp.pl). The dictionary is interesting for several reasons: it

is comprehensive (over 330,000 lexemes corresponding to almost

4,300,000 different textual words; 1116 handcrafted inflectional

patterns), the inflection is presented in an explicit manner in the

form of carefully designed tables, the user interface facilitates

advanced queries by several features (lemmas, forms, applicable

grammatical categories, types of inflection). Moreover, the data

of the dictionary is used in morphological analysers, including our

product Morfeusz (http://sgjp.pl/morfeusz). From the

start, the dictionary was meant to be comfortable for the human

reader as well as to be ready for use in NLP applications. In the

paper we briefly discuss both aspects of the resource.

P34 - Semantic LexiconsThursday, May 26, 14:55

Chairperson: Kiril Simov Poster Session

Automatically Generated Affective Norms ofAbstractness, Arousal, Imageability and Valencefor 350 000 German Lemmas

Maximilian Köper and Sabine Schulte im Walde

This paper presents a collection of 350$ $000 German lemmatised

words, rated on four psycholinguistic affective attributes. All

ratings were obtained via a supervised learning algorithm that

can automatically calculate a numerical rating of a word. We

applied this algorithm to abstractness, arousal, imageability

and valence. Comparison with human ratings reveals high

correlation across all rating types. The full resource is publically

available at: http://www.ims.uni-stuttgart.de/

data/affective_norms/.

Latin Vallex. A Treebank-based Semantic ValencyLexicon for Latin

Marco Passarotti, Berta González Saavedra andChristophe Onambele

Despite a centuries-long tradition in lexicography, Latin lacks

state-of-the-art computational lexical resources. This situation is

strictly related to the still quite limited amount of linguistically

90

annotated textual data for Latin, which can help the building of

new lexical resources by supporting them with empirical evidence.

However, projects for creating new language resources for Latin

have been launched over the last decade to fill this gap. In this

paper, we present Latin Vallex, a valency lexicon for Latin built in

mutual connection with the semantic and pragmatic annotation of

two Latin treebanks featuring texts of different eras. On the one

hand, such a connection between the empirical evidence provided

by the treebanks and the lexicon allows to enhance each frame

entry in the lexicon with its frequency in real data. On the other

hand, each valency-capable word in the treebanks is linked to a

frame entry in the lexicon.

A Framework for Cross-lingual/Node-wiseAlignment of Lexical-Semantic Resources

Yoshihiko Hayashi

Given lexical-semantic resources in different languages, it is

useful to establish cross-lingual correspondences, preferably with

semantic relation labels, between the concept nodes in these

resources. This paper presents a framework for enabling a cross-

lingual/node-wise alignment of lexical-semantic resources, where

cross-lingual correspondence candidates are first discovered and

ranked, and then classified by a succeeding module. Indeed,

we propose that a two-tier classifier configuration is feasible

for the second module: the first classifier filters out possibly

irrelevant correspondence candidates and the second classifier

assigns a relatively fine-grained semantic relation label to each

of the surviving candidates. The results of Japanese-to-English

alignment experiments using EDR Electronic Dictionary and

Princeton WordNet are described to exemplify the validity of the

proposal.

Lexical Coverage Evaluation of Large-scaleMultilingual Semantic Lexicons for TwelveLanguages

Scott Piao, Paul Rayson, Dawn Archer, FrancescaBianchi, Carmen Dayrell, Mahmoud El-Haj, Ricardo-María Jiménez, Dawn Knight, Michal Kren, Laura Löfberg,Rao Muhammad Adeel Nawab, Jawad Shafi, Phoey Lee Tehand Olga Mudraya

The last two decades have seen the development of various

semantic lexical resources such as WordNet (Miller, 1995) and the

USAS semantic lexicon (Rayson et al., 2004), which have played

an important role in the areas of natural language processing

and corpus-based studies. Recently, increasing efforts have

been devoted to extending the semantic frameworks of existing

lexical knowledge resources to cover more languages, such as

EuroWordNet and Global WordNet. In this paper, we report on

the construction of large-scale multilingual semantic lexicons for

twelve languages, which employ the unified Lancaster semantic

taxonomy and provide a multilingual lexical knowledge base

for the automatic UCREL semantic annotation system (USAS).

Our work contributes towards the goal of constructing larger-

scale and higher-quality multilingual semantic lexical resources

and developing corpus annotation tools based on them. Lexical

coverage is an important factor concerning the quality of the

lexicons and the performance of the corpus annotation tools, and

in this experiment we focus on evaluating the lexical coverage

achieved by the multilingual lexicons and semantic annotation

tools based on them. Our evaluation shows that some semantic

lexicons such as those for Finnish and Italian have achieved lexical

coverage of over 90% while others need further expansion.

Building Concept Graphs from MonolingualDictionary EntriesGábor Recski

We present the dict_ to_ 4lang tool for processing entries

of three monolingual dictionaries of English and mapping

definitions to concept graphs following the 4lang principles of

semantic representation introduced by (Kornai, 2010). 4lang

representations are domain- and language-independent, and make

use of only a very limited set of primitives to encode the meaning

of all utterances. Our pipeline relies on the Stanford Dependency

Parser for syntactic analysis, the dep to 4lang module then

builds directed graphs of concepts based on dependency relations

between words in each definition. Several issues are handled by

construction-specific rules that are applied to the output of dep_

to_ 4lang. Manual evaluation suggests that ca. 75% of graphs built

from the Longman Dictionary are either entirely correct or contain

only minor errors. dict_ to_ 4lang is available under an MIT

license as part of the 4lang library and has been used successfully

in measuring Semantic Textual Similarity (Recski and Ács, 2015).

An interactive demo of core 4lang functionalities is available at

http://4lang.hlt.bme.hu.

Semantic Layer of the Valence Dictionary ofPolish WalentyElzbieta Hajnicz, Anna Andrzejczuk and Tomasz Bartosiak

This article presents the semantic layer of Walenty—a new

valence dictionary of Polish predicates, with a number of novel

features, as compared to other such dictionaries. The dictionary

contains two layers, syntactic and semantic. The syntactic layer

describes syntactic and morphosyntactic constraints predicates put

on their dependants. In particular, it includes a comprehensive

and powerful phraseological component. The semantic layer

shows how predicates and their arguments are involved in

a described situation in an utterance. These two layers

91

are connected, representing how semantic arguments can be

realised on the surface. Each syntactic schema and each

semantic frame are illustrated by at least one exemplary sentence

attested in linguistic reality. The semantic layer consists of

semantic frames represented as lists of pairs <semantic role,

selectional preference> and connected with PlWordNet lexical

units. Semantic roles have a two-level representation (basic

roles are provided with an attribute) enabling representation of

arguments in a flexible way. Selectional preferences are based

on PlWordNet structure as well.

Italian VerbNet: A Construction-based Approachto Italian Verb Classification

Lucia Busso and Alessandro Lenci

This paper proposes a new method for Italian verb classification

-and a preliminary example of resulting classes- inspired by

Levin (1993) and VerbNet (Kipper-Schuler, 2005), yet partially

independent from these resources; we achieved such a result by

integrating Levin and VerbNet’s models of classification with

other theoretic frameworks and resources. The classification is

rooted in the constructionist framework (Goldberg, 1995; 2006)

and is distribution-based. It is also semantically characterized

by a link to FrameNet’ssemanticframesto represent the event

expressed by a class. However, the new Italian classes maintain

the hierarchic “tree” structure and monotonic nature of VerbNet’s

classes, and, where possible, the original names (e.g.: Verbs of

Killing, Verbs of Putting, etc.). We therefore propose here a

taxonomy compatible with VerbNet but at the same time adapted

to Italian syntax and semantics. It also addresses a number of

problems intrinsic to the original classifications, such as the role

of argument alternations, here regarded simply as epiphenomena,

consistently with the constructionist approach.

A Large Rated Lexicon with French MedicalWords

Natalia Grabar and Thierry Hamon

Patients are often exposed to medical terms, such as anosognosia,

myelodysplastic, or hepatojejunostomy, that can be semantically

complex and hardly understandable by non-experts in medicine.

Hence, it is important to assess which words are potentially non-

understandable and require further explanations. The purpose

of our work is to build specific lexicon in which the words

are rated according to whether they are understandable or non-

understandable. We propose to work with medical words in

French such as provided by an international medical terminology.

The terms are segmented in single words and then each word

is manually processed by three annotators. The objective is

to assign each word into one of the three categories: I can

understand, I am not sure, I cannot understand. The annotators

do not have medical training nor they present specific medical

problems. They are supposed to represent an average patient.

The inter-annotator agreement is then computed. The content

of the categories is analyzed. Possible applications in which

this lexicon can be helpful are proposed and discussed. The

rated lexicon is freely available for the research purposes. It

is accessible online at http://natalia.grabar.perso.

sfr.fr/rated-lexicon.html.

Best of Both Worlds: Making Word SenseEmbeddings Interpretable

Alexander Panchenko

Word sense embeddings represent a word sense as a low-

dimensional numeric vector. While this representation is

potentially useful for NLP applications, its interpretability is

inherently limited. We propose a simple technique that improves

interpretability of sense vectors by mapping them to synsets

of a lexical resource. Our experiments with AdaGram sense

embeddings and BabelNet synsets show that it is possible to

retrieve synsets that correspond to automatically learned sense

vectors with Precision of 0.87, Recall of 0.42 and AUC of 0.78.

VerbLexPor: a lexical resource with semanticroles for Portuguese

Leonardo Zilio, Maria José Bocorny Finatto and AlineVillavicencio

This paper presents a lexical resource developed for Portuguese.

The resource contains sentences annotated with semantic roles.

The sentences were extracted from two domains: Cardiology

research papers and newspaper articles. Both corpora were

analyzed with the PALAVRAS parser and subsequently processed

with a subcategorization frames extractor, so that each sentence

that contained at least one main verb was stored in a database

together with its syntactic organization. The annotation was

manually carried out by a linguist using an annotation interface.

Both the annotated and non-annotated data were exported to an

XML format, which is readily available for download. The reason

behind exporting non-annotated data is that there is syntactic

information collected from the parser annotation in the non-

annotated data, and this could be useful for other researchers. The

sentences from both corpora were annotated separately, so that

it is possible to access sentences either from the Cardiology or

from the newspaper corpus. The full resource presents more than

seven thousand semantically annotated sentences, containing 192

92

different verbs and more than 15 thousand individual arguments

and adjuncts.

A Multilingual Predicate Matrix

Maddalen Lopez de Lacalle, Egoitz Laparra, Itziar Aldabeand German Rigau

This paper presents the Predicate Matrix 1.3, a lexical

resource resulting from the integration of multiple sources of

predicate information including FrameNet, VerbNet, PropBank

and WordNet. This new version of the Predicate Matrix has

been extended to cover nominal predicates by adding mappings

to NomBank. Similarly, we have integrated resources in Spanish,

Catalan and Basque. As a result, the Predicate Matrix 1.3 provides

a multilingual lexicon to allow interoperable semantic analysis in

multiple languages.

A Gold Standard for Scalar Adjectives

Bryan Wilkinson and Oates Tim

We present a gold standard for evaluating scale membership

and the order of scalar adjectives. In addition to evaluating

existing methods of ordering adjectives, this knowledge will

aid in studying the organization of adjectives in the lexicon.

This resource is the result of two elicitation tasks conducted

with informants from Amazon Mechanical Turk. The first

task is notable for gathering open-ended lexical data from

informants. The data is analyzed using Cultural Consensus

Theory, a framework from anthropology, to not only determine

scale membership but also the level of consensus among the

informants (Romney et al., 1986). The second task gathers a

culturally salient ordering of the words determined to be members.

We use this method to produce 12 scales of adjectives for use in

evaluation.

VerbCROcean: A Repository of Fine-GrainedSemantic Verb Relations for Croatian

Ivan Sekulic and Jan Šnajder

In this paper we describe VerbCROcean, a broad-coverage

repository of fine-grained semantic relations between Croatian

verbs. Adopting the methodology of Chklovski and Pantel

(2004) used for acquiring the English VerbOcean, we first acquire

semantically related verb pairs from a web corpus hrWaC by

relying on distributional similarity of subject-verb-object paths in

the dependency trees. We then classify the semantic relations

between each pair of verbs as similarity, intensity, antonymy, or

happens-before, using a number of manually-constructed lexico-

syntatic patterns. We evaluate the quality of the resulting resource

on a manually annotated sample of 1000 semantic verb relations.

The evaluation revealed that the predictions are most accurate for

the similarity relation, and least accurate for the intensity relation.

We make available two variants of VerbCROcean: a coverage-

oriented version, containing about 36k verb pairs at a precision of

41%, and a precision-oriented version containing about 5k verb

pairs, at a precision of 56%.

Enriching a Portuguese WordNet using Synonymsfrom a Monolingual Dictionary

Alberto Simões, Xavier Gómez Guinovart and José JoãoAlmeida

In this article we present an exploratory approach to enrich a

WordNet-like lexical ontology with the synonyms present in a

standard monolingual Portuguese dictionary. The dictionary was

converted from PDF into XML and senses were automatically

identified and annotated. This allowed us to extract them,

independently of definitions, and to create sets of synonyms

(synsets). These synsets were then aligned with WordNet

synsets, both in the same language (Portuguese) and projecting

the Portuguese terms into English, Spanish and Galician. This

process allowed both the addition of new term variants to existing

synsets, as to create new synsets for Portuguese.

O25 - Sentiment AnalysisThursday, May 26, 14:55

Chairperson: Frédérique Segond Oral Session

Reliable Baselines for Sentiment Analysis inResource-Limited Languages: The Serbian MovieReview Dataset

Vuk Batanovic, Boško Nikolic and Milan Milosavljevic

Collecting data for sentiment analysis in resource-limited

languages carries a significant risk of sample selection bias,

since the small quantities of available data are most likely not

representative of the whole population. Ignoring this bias leads to

less robust machine learning classifiers and less reliable evaluation

results. In this paper we present a dataset balancing algorithm

that minimizes the sample selection bias by eliminating irrelevant

systematic differences between the sentiment classes. We prove

its superiority over the random sampling method and we use

it to create the Serbian movie review dataset – SerbMR – the

first balanced and topically uniform sentiment analysis dataset in

Serbian. In addition, we propose an incremental way of finding

the optimal combination of simple text processing options and

machine learning features for sentiment classification. Several

popular classifiers are used in conjunction with this evaluation

93

approach in order to establish strong but reliable baselines for

sentiment analysis in Serbian.

ANTUSD: A Large Chinese Sentiment Dictionary

Shih-Ming Wang and Lun-Wei Ku

This paper introduces the augmented NTU sentiment dictionary,

abbreviated as ANTUSD, which is constructed by collecting

sentiment stats of words in several sentiment annotation work.

A total of 26,021 words were collected in ANTUSD. For each

word, the CopeOpi numerical sentiment score and the number of

positive annotation, neutral annotation, negative annotation, non-

opinionated annotation, and not-a-word annotation are provided.

Words and their sentiment information in ANTUSD have been

linked to the Chinese ontology E-HowNet to provide rich

semantic information. We demonstrate the usage of ANTUSD

in polarity classification of words, and the results show that a

superior f-score 98.21 is achieved, which supports the usefulness

of the ANTUSD. ANTUSD can be freely obtained through

application from NLPSA lab, Academia Sinica: http://

academiasinicanlplab.github.io/

Aspect based Sentiment Analysis in Hindi:Resource Creation and Evaluation

Md Shad Akhtar, Asif Ekbal and Pushpak Bhattacharyya

Due to the phenomenal growth of online product reviews,

sentiment analysis (SA) has gained huge attention, for example,

by online service providers. A number of benchmark datasets for

a wide range of domains have been made available for sentiment

analysis, especially in resource-rich languages. In this paper we

assess the challenges of SA in Hindi by providing a benchmark

setup, where we create an annotated dataset of high quality, build

machine learning models for sentiment analysis in order to show

the effective usage of the dataset, and finally make the resource

available to the community for further advancement of research.

The dataset comprises of Hindi product reviews crawled from

various online sources. Each sentence of the review is annotated

with aspect term and its associated sentiment. As classification

algorithms we use Conditional Random Filed (CRF) and Support

Vector Machine (SVM) for aspect term extraction and sentiment

analysis, respectively. Evaluation results show the average F-

measure of 41.07% for aspect term extraction and accuracy of

54.05% for sentiment classification.

Gulf Arabic Linguistic Resource Building forSentiment Analysis

Wafia Adouane and Richard Johansson

This paper deals with building linguistic resources for Gulf

Arabic, one of the Arabic variations, for sentiment analysis task

using machine learning. To our knowledge, no previous works

were done for Gulf Arabic sentiment analysis despite the fact

that it is present in different online platforms. Hence, the first

challenge is the absence of annotated data and sentiment lexicons.

To fill this gap, we created these two main linguistic resources.

Then we conducted different experiments: use Naive Bayes

classifier without any lexicon; add a sentiment lexicon designed

basically for MSA; use only the compiled Gulf Arabic sentiment

lexicon and finally use both MSA and Gulf Arabic sentiment

lexicons. The Gulf Arabic lexicon gives a good improvement

of the classifier accuracy (90.54 %) over a baseline that does

not use the lexicon (82.81%), while the MSA lexicon causes

the accuracy to drop to (76.83%). Moreover, mixing MSA and

Gulf Arabic lexicons causes the accuracy to drop to (84.94%)

compared to using only Gulf Arabic lexicon. This indicates that

it is useless to use MSA resources to deal with Gulf Arabic due

to the considerable differences and conflicting structures between

these two languages.

Using Data Mining Techniques for SentimentShifter Identification

Samira Noferesti and Mehrnoush Shamsfard

Sentiment shifters, i.e., words and expressions that can affect text

polarity, play an important role in opinion mining. However, the

limited ability of current automated opinion mining systems to

handle shifters represents a major challenge. The majority of

existing approaches rely on a manual list of shifters; few attempts

have been made to automatically identify shifters in text. Most of

them just focus on negating shifters. This paper presents a novel

and efficient semi-automatic method for identifying sentiment

shifters in drug reviews, aiming at improving the overall accuracy

of opinion mining systems. To this end, we use weighted

association rule mining (WARM), a well-known data mining

technique, for finding frequent dependency patterns representing

sentiment shifters from a domain-specific corpus. These patterns

that include different kinds of shifter words such as shifter verbs

and quantifiers are able to handle both local and long-distance

shifters. We also combine these patterns with a lexicon-based

approach for the polarity classification task. Experiments on

drug reviews demonstrate that extracted shifters can improve the

precision of the lexicon-based approach for polarity classification

9.25 percent.

94

O26 - Discourse and DialogueThursday, May 26, 14:55

Chairperson: Mark Liberman Oral Session

Discourse Structure and Dialogue Acts inMultiparty Dialogue: the STAC Corpus

Nicholas Asher, Julie Hunter, Mathieu Morey, BenamaraFarah and Stergos Afantenos

This paper describes the STAC resource, a corpus of multi-party

chats annotated for discourse structure in the style of SDRT (Asher

and Lascarides, 2003; Lascarides and Asher, 2009). The main

goal of the STAC project is to study the discourse structure

of multi-party dialogues in order to understand the linguistic

strategies adopted by interlocutors to achieve their conversational

goals, especially when these goals are opposed. The STAC corpus

is not only a rich source of data on strategic conversation, but also

the first corpus that we are aware of that provides full discourse

structures for multi-party dialogues. It has other remarkable

features that make it an interesting resource for other topics:

interleaved threads, creative language, and interactions between

linguistic and extra-linguistic contexts.

Purely Corpus-based Automatic ConversationAuthoring

Guillaume Dubuisson Duplessis, Vincent Letard, Anne-Laure Ligozat and Sophie Rosset

This paper presents an automatic corpus-based process to author

an open-domain conversational strategy usable both in chatterbot

systems and as a fallback strategy for out-of-domain human

utterances. Our approach is implemented on a corpus of television

drama subtitles. This system is used as a chatterbot system to

collect a corpus of 41 open-domain textual dialogues with 27

human participants. The general capabilities of the system are

studied through objective measures and subjective self-reports in

terms of understandability, repetition and coherence of the system

responses selected in reaction to human utterances. Subjective

evaluations of the collected dialogues are presented with respect

to amusement, engagement and enjoyability. The main factors

influencing those dimensions in our chatterbot experiment are

discussed.

Dialogue System Characterisation byBack-channelling Patterns Extracted fromDialogue Corpus

Masashi Inoue and Hiroshi Ueno

In this study, we describe the use of back-channelling patterns

extracted from a dialogue corpus as a mean to characterising text-

based dialogue systems. Our goal was to provide system users

with the feeling that they are interacting with distinct individuals

rather than artificially created characters. An analysis of the

corpus revealed that substantial difference exists among speakers

regarding the usage patterns of back-channelling. The patterns

consist of back-channelling frequency, types, and expressions.

They were used for system characterisation. Implemented system

characters were tested by asking users of the dialogue system to

identify the source speakers in the corpus. Experimental results

suggest that possibility of using back-channelling patterns alone

to characterize the dialogue system in some cases even among the

same age and gender groups.

Towards Automatic Identification of EffectiveClues for Team Word-Guessing Games

Eli Pincus and David Traum

Team word-guessing games where one player, the clue-giver,

gives clues attempting to elicit a target-word from another player,

the receiver, are a popular form of entertainment and also used for

educational purposes. Creating an engaging computational agent

capable of emulating a talented human clue-giver in a timed word-

guessing game depends on the ability to provide effective clues

(clues able to elicit a correct guess from a human receiver). There

are many available web resources and databases that can be mined

for the raw material for clues for target-words; however, a large

number of those clues are unlikely to be able to elicit a correct

guess from a human guesser. In this paper, we propose a method

for automatically filtering a clue corpus for effective clues for an

arbitrary target-word from a larger set of potential clues, using

machine learning on a set of features of the clues, including point-

wise mutual information between a clue’s constituent words and

a clue’s target-word. The results of the experiments significantly

improve the average clue quality over previous approaches, and

bring quality rates in-line with measures of human clue quality

derived from a corpus of human-human interactions. The paper

also introduces the data used to develop this method; audio

recordings of people making guesses after having heard the clues

being spoken by a synthesized voice.

Automatic Construction of Discourse Corpora forDialogue Translation

Longyue Wang, Xiaojun Zhang, Zhaopeng Tu, Andy Wayand Qun Liu

In this paper, a novel approach is proposed to automatically

construct parallel discourse corpus for dialogue machine

translation. Firstly, the parallel subtitle data and its corresponding

monolingual movie script data are crawled and collected from

Internet. Then tags such as speaker and discourse boundary from

the script data are projected to its subtitle data via an information

retrieval approach in order to map monolingual discourse to

95

bilingual texts. We not only evaluate the mapping results, but also

integrate speaker information into the translation. Experiments

show our proposed method can achieve 81.79% and 98.64%

accuracy on speaker and dialogue boundary annotation, and

speaker-based language model adaptation can obtain around 0.5

BLEU points improvement in translation qualities. Finally, we

publicly release around 100K parallel discourse data with manual

speaker and dialogue boundary annotation.

O27 - Machine Translation and Evaluation (2)Thursday, May 26, 14:55

Chairperson: Nizar Habash Oral Session

Using Contextual Information for MachineTranslation Evaluation

Marina Fomicheva and Núria Bel

Automatic evaluation of Machine Translation (MT) is typically

approached by measuring similarity between the candidate MT

and a human reference translation. An important limitation of

existing evaluation systems is that they are unable to distinguish

candidate-reference differences that arise due to acceptable

linguistic variation from the differences induced by MT errors.

In this paper we present a new metric, UPF-Cobalt, that addresses

this issue by taking into consideration the syntactic contexts of

candidate and reference words. The metric applies a penalty when

the words are similar but the contexts in which they occur are

not equivalent. In this way, Machine Translations (MTs) that are

different from the human translation but still essentially correct are

distinguished from those that share high number of words with the

reference but alter the meaning of the sentence due to translation

errors. The results show that the method proposed is indeed

beneficial for automatic MT evaluation. We report experiments

based on two different evaluation tasks with various types of

manual quality assessment. The metric significantly outperforms

state-of-the-art evaluation systems in varying evaluation settings.

Bootstrapping a Hybrid MT System to a NewLanguage Pair

João António Rodrigues, Nuno Rendeiro, Andreia Querido,Sanja Štajner and António Branco

The usual concern when opting for a rule-based or a hybrid

machine translation (MT) system is how much effort is required to

adapt the system to a different language pair or a new domain. In

this paper, we describe a way of adapting an existing hybrid MT

system to a new language pair, and show that such a system can

outperform a standard phrase-based statistical machine translation

system with an average of 10 persons/month of work. This is

specifically important in the case of domain-specific MT for which

there is not enough parallel data for training a statistical machine

translation system.

Filtering Wiktionary Triangles by LinearMbetween Distributed Word Models

Márton Makrai

Word translations arise in dictionary-like organization as well as

via machine learning from corpora. The former is exemplified

by Wiktionary, a crowd-sourced dictionary with editions in many

languages. Ács et al. (2013) obtain word translations from

Wiktionary with the pivot-based method, also called triangulation,

that infers word translations in a pair of languages based on

translations to other, typically better resourced ones called pivots.

Triangulation may introduce noise if words in the pivot are

polysemous. The reliability of each triangulated translation is

basically estimated by the number of pivot languages (Tanaka et

al 1994). Mikolov et al (2013) introduce a method for generating

or scoring word translations. Translation is formalized as a linear

mapping between distributed vector space models (VSM) of the

two languages. VSMs are trained on monolingual data, while

the mapping is learned in a supervised fashion, using a seed

dictionary of some thousand word pairs. The mapping can be

used to associate existing translations with a real-valued similarity

score. This paper exploits human labor in Wiktionary combined

with distributional information in VSMs. We train VSMs on

gigaword corpora, and the linear translation mapping on direct

(non-triangulated) Wiktionary pairs. This mapping is used to

filter triangulated translations based on scores. The motivation

is that scores by the mapping may be a smoother measure of

merit than considering only the number of pivot for the triangle.

We evaluate the scores against dictionaries extracted from parallel

corpora (Tiedemann 2012). We show that linear translation really

provides a more reliable method for triangle scoring than pivot

count. The methods we use are language-independent, and the

training data is easy to obtain for many languages. We chose

the German-Hungarian pair for evaluation, in which the filtered

triangles resulting from our experiments are the greatest freely

available list of word translations we are aware of.

Translation Errors and Incomprehensibility: aCase Study using Machine-Translated SecondLanguage Proficiency Tests

Takuya Matsuzaki, Akira Fujita, Naoya Todo and Noriko H.Arai

This paper reports on an experiment where 795 human

participants answered to the questions taken from second language

proficiency tests that were translated to their native language. The

96

output of three machine translation systems and two different

human translations were used as the test material. We classified

the translation errors in the questions according to an error

taxonomy and analyzed the participants’ response on the basis

of the type and frequency of the translation errors. Through the

analysis, we identified several types of errors that deteriorated

most the accuracy of the participants’ answers, their confidence on

the answers, and their overall evaluation of the translation quality.

Word Sense-Aware Machine Translation:Including Senses as Contextual Features forImproved Translation Models

Steven Neale, Luís Gomes, Eneko Agirre, Oier Lopez deLacalle and António Branco

Although it is commonly assumed that word sense disambiguation

(WSD) should help to improve lexical choice and improve the

quality of machine translation systems, how to successfully

integrate word senses into such systems remains an unanswered

question. Some successful approaches have involved

reformulating either WSD or the word senses it produces, but

work on using traditional word senses to improve machine

translation have met with limited success. In this paper,

we build upon previous work that experimented on including

word senses as contextual features in maxent-based translation

models. Training on a large, open-domain corpus (Europarl), we

demonstrate that this aproach yields significant improvements in

machine translation from English to Portuguese.

O28 - Corpus Querying and CrawlingThursday, May 26, 14:55

Chairperson: Tomaž Erjavec Oral Session

SuperCAT: The (New and Improved) CorpusAnalysis Toolkit

K. Bretonnel Cohen, William A. Baumgartner Jr. and IrinaTemnikova

This paper reports SuperCAT, a corpus analysis toolkit. It is a

radical extension of SubCAT, the Sublanguage Corpus Analysis

Toolkit, from sublanguage analysis to corpus analysis in general.

The idea behind SuperCAT is that representative corpora have

no tendency towards closure—that is, they tend towards infinity.

In contrast, non-representative corpora have a tendency towards

closure—roughly, finiteness. SuperCAT focuses on general

techniques for the quantitative description of the characteristics of

any corpus (or other language sample), particularly concerning the

characteristics of lexical distributions. Additionally, SuperCAT

features a complete re-engineering of the previous SubCAT

architecture.

LanguageCrawl: A Generic Tool for BuildingLanguage Models Upon Common-Crawl

Szymon Roziewski and Wojciech Stokowiec

The web data contains immense amount of data, hundreds

of billion words are waiting to be extracted and used for

language research. In this work we introduce our tool

LanguageCrawl which allows NLP researchers to easily construct

web-scale corpus from Common Crawl Archive: a petabyte

scale, open repository of web crawl information. Three use-

cases are presented: filtering Polish websites, building an N-gram

corpora and training continuous skip-gram language model with

hierarchical softmax. Each of them has been implemented within

the LanguageCrawl toolkit, with the possibility to adjust specified

language and N-gram ranks. Special effort has been put on high

computing efficiency, by applying highly concurrent multitasking.

We make our tool publicly available to enrich NLP resources.

We strongly believe that our work will help to facilitate NLP

research, especially in under-resourced languages, where the lack

of appropriately sized corpora is a serious hindrance to applying

data-intensive methods, such as deep neural networks.

Features for Generic Corpus Querying

Thomas Eckart, Christoph Kuras and Uwe Quasthoff

The availability of large corpora for more and more languages

enforces generic querying and standard interfaces. This

development is especially relevant in the context of integrated

research environments like CLARIN or DARIAH. The paper

focuses on several applications and implementation details on

the basis of a unified corpus format, a unique POS tag set,

and prepared data for word similarities. All described data or

applications are already or will be in the near future accessible

via well-documented RESTful Web services. The target group are

all kinds of interested persons with varying level of experience in

programming or corpus query languages.

European Union Language Resources in SketchEngine

Vít Baisa, Jan Michelfeit, Marek Medved’ and MilosJakubicek

Several parallel corpora built from European Union language

resources are presented here. They were processed by state-of-the-

art tools and made available for researchers in the corpus manager

Sketch Engine. A completely new resource is introduced: EUR-

Lex Corpus, being one of the largest parallel corpus available at

the moment, containing 840 million English tokens and the largest

97

language pair English-French has more than 25 million aligned

segments (paragraphs).

Corpus Query Lingua Franca (CQLF)

Piotr Banski, Elena Frick and Andreas Witt

The present paper describes Corpus Query Lingua Franca (ISO

CQLF), a specification designed at ISO Technical Committee

37 Subcommittee 4 “Language resource management” for the

purpose of facilitating the comparison of properties of corpus

query languages. We overview the motivation for this endeavour

and present its aims and its general architecture. CQLF is intended

as a multi-part specification; here, we concentrate on the basic

metamodel that provides a frame that the other parts fit in.

P35 - Grammar and SyntaxThursday, May 26, 16:55

Chairperson: Maria Simi Poster Session

A sense-based lexicon of count and massexpressions: The Bochum English CountabilityLexicon

Tibor Kiss, Francis Jeffry Pelletier, Halima Husic, RomanNino Simunic and Johanna Marie Poppek

The present paper describes the current release of the Bochum

English Countability Lexicon (BECL 2.1), a large empirical

database consisting of lemmata from Open ANC (http://

www.anc.org) with added senses from WordNet (Fellbaum

1998). BECL 2.1 contains 11,800 annotated noun-sense

pairs, divided in four major countability classes and 18 fine-

grained subclasses. In the current version, BECL also provides

information on nouns whose senses occur in more than one

class allowing a closer look on polysemy and homonymy with

regard to countability. Further included are sets of similar

senses using the Leacock and Chodorow (LCH) score for

semantic similarity (Leacock & Chodorow 1998), information

on orthographic variation, on the completeness of all WordNet

senses in the database and an annotated representation of different

types of proper names. The further development of BECL will

investigate the different countability classes of proper names and

the general relation between semantic similarity and countability

as well as recurring syntactic patterns for noun-sense pairs. The

BECL 2.1 database is also publicly available via http://

count-and-mass.org.

Detecting Optional Arguments of Verbs

Andras Kornai, Dávid Márk Nemeskey and Gábor Recski

We propose a novel method for detecting optional arguments of

Hungarian verbs using only positive data. We introduce a custom

variant of collexeme analysis that explicitly models the noise in

verb frames. Our method is, for the most part, unsupervised:

we use the spectral clustering algorithm described in Brew and

Schulte in Walde (2002) to build a noise model from a short,

manually verified seed list of verbs. We experimented with

both raw count- and context-based clusterings and found their

performance almost identical. The code for our algorithm and the

frame list are freely available at http://hlt.bme.hu/en/

resources/tade.

Leveraging Native Data to Correct PrepositionErrors in Learners’ Dutch

Lennart Kloppenburg and Malvina Nissim

We address the task of automatically correcting preposition errors

in learners’ Dutch by modelling preposition usage in native

language. Specifically, we build two models exploiting a large

corpus of Dutch. The first is a binary model for detecting whether

a preposition should be used at all in a given position or not.

The second is a multiclass model for selecting the appropriate

preposition in case one should be used. The models are tested

on native as well as learners data. For the latter we exploit a

crowdsourcing strategy to elicit native judgements. On native test

data the models perform very well, showing that we can model

preposition usage appropriately. However, the evaluation on

learners’ data shows that while detecting that a given preposition

is wrong is doable reasonably well, detecting the absence of a

preposition is a lot more difficult. Observing such results and

the data we deal with, we envisage various ways of improving

performance, and report them in the final section of this article.

D(H)ante: A New Set of Tools for XIII CenturyItalian

Angelo Basile and Federico Sangati

In this paper we describe 1) the process of converting a corpus of

Dante Alighieri from a TEI XML format in to a pseudo-CoNLL

format; 2) how a pos-tagger trained on modern Italian performs on

Dante’s Italian 3) the performances of two different pos-taggers

trained on the given corpus. We are making our conversion

scripts and models available to the community. The two other

models trained on the corpus performs reasonably well. The tool

used for the conversion process might turn useful for bridging

the gap between traditional digital humanities and modern NLP

applications since the TEI original format is not usually suitable

for being processed with standard NLP tools. We believe our work

will serve both communities: the DH community will be able to

tag new documents and the NLP world will have an easier way in

98

converting existing documents to a standardized machine-readable

format.

Multilevel Annotation of Agreement andDisagreement in Italian News Blogs

Fabio Celli, Giuseppe Riccardi and Firoj Alam

In this paper, we present a corpus of news blog conversations

in Italian annotated with gold standard agreement/disagreement

relations at message and sentence levels. This is the first resource

of this kind in Italian. From the analysis of ADRs at the two

levels emerged that agreement annotated at message level is

consistent and generally reflected at sentence level, moreover, the

argumentation structure of disagreement is more complex than

agreement. The manual error analysis revealed that this resource

is useful not only for the analysis of argumentation, but also for

the detection of irony/sarcasm in online debates. The corpus and

annotation tool are available for research purposes on request.

Sentence Similarity based on Dependency TreeKernels for Multi-document Summarization

Saziye Betül Özates, Arzucan Özgür and Dragomir Radev

We introduce an approach based on using the dependency

grammar representations of sentences to compute sentence

similarity for extractive multi-document summarization. We

adapt and investigate the effects of two untyped dependency

tree kernels, which have originally been proposed for relation

extraction, to the multi-document summarization problem. In

addition, we propose a series of novel dependency grammar based

kernels to better represent the syntactic and semantic similarities

among the sentences. The proposed methods incorporate the type

information of the dependency relations for sentence similarity

calculation. To our knowledge, this is the first study that

investigates using dependency tree based sentence similarity for

multi-document summarization.

Discontinuous Verb Phrases in Parsing andMachine Translation of English and German

Sharid Loáiciga and Kristina Gulordava

In this paper, we focus on the verb-particle (V-Prt) split

construction in English and German and its difficulty for parsing

and Machine Translation (MT). For German, we use an existing

test suite of V-Prt split constructions, while for English, we

build a new and comparable test suite from raw data. These

two data sets are then used to perform an analysis of errors in

dependency parsing, word-level alignment and MT, which arise

from the discontinuous order in V-Prt split constructions. In the

automatic alignments of parallel corpora, most of the particles

align to NULL. These mis-alignments and the inability of phrase-

based MT system to recover discontinuous phrases result in low

quality translations of V-Prt split constructions both in English and

German. However, our results show that the V-Prt split phrases

are correctly parsed in 90% of cases, suggesting that syntactic-

based MT should perform better on these constructions. We

evaluate a syntactic-based MT system on German and compare

its performance to the phrase-based system.

A Lexical Resource for the Identification of “WeakWords” in German Specification Documents

Jennifer Krisch, Melanie Dick, Ronny Jauch and UlrichHeid

We report on the creation of a lexical resource for the identification

of potentially unspecific or imprecise constructions in German

requirements documentation from the car manufacturing industry.

In requirements engineering, such expressions are called “weak

words”: they are not sufficiently precise to ensure an unambiguous

interpretation by the contractual partners, who for the definition

of their cooperation, typically rely on specification documents

(Melchisedech, 2000); an example are dimension adjectives, such

as kurz or lang (‘short’, ‘long’) which need to be modified

by adverbials indicating the exact duration, size etc. Contrary

to standard practice in requirements engineering, where the

identification of such weak words is merely based on stopword

lists, we identify weak uses in context, by querying annotated text.

The queries are part of the resource, as they define the conditions

when a word use is weak. We evaluate the recognition of weak

uses on our development corpus and on an unseen evaluation

corpus, reaching stable F1-scores above 0.95.

Recent Advances in Development of aLexicon-Grammar of Polish: PolNet 3.0

Zygmunt Vetulani, Grazyna Vetulani and BartłomiejKochanowski

The granularity of PolNet (Polish Wordnet) is the main theoretical

issue discussed in the paper. We describe the latest extension of

PolNet including valency information of simple verbs and noun-

verb collocations using manual and machine-assisted methods.

Valency is defined to include both semantic and syntactic

selectional restrictions. We assume the valency structure of a

verb to be an index of meaning. Consistently we consider it an

attribute of a synset. Strict application of this principle results in

fine granularity of the verb section of the wordnet. Considering

valency as a distinctive feature of synsets was an essential step to

transform the initial PolNet (first intended as a lexical ontology)

into a lexicon-grammar. For the present refinement of PolNet we

assume that the category of language register is a part of meaning.

99

The totality of PolNet 2.0 synsets is being revised in order to split

the PolNet 2.0 synsets that contain different register words into

register-uniform sub-synsets. We completed this operation for

synsets that were used as values of semantic roles. The operation

augmented the number of considered synsets by 29%. In the

paper we report an extension of the class of collocation-based verb

synsets.

C-WEP–Rich Annotated Collection of WritingErrors by Professionals

Cerstin Mahlow

This paper presents C-WEP, the Collection of Writing Errors

by Professionals Writers of German. It currently consists of

245 sentences with grammatical errors. All sentences are taken

from published texts. All authors are professional writers with

high skill levels with respect to German, the genres, and the

topics. The purpose of this collection is to provide seeds for more

sophisticated writing support tools as only a very small proportion

of those errors can be detected by state-of-the-art checkers. C-

WEP is annotated on various levels and freely available.

Improving corpus search via parsing

Natalia Klyueva and Pavel Stranák

In this paper, we describe an addition to the corpus query

system Kontext that enables to enhance the search using syntactic

attributes in addition to the existing features, mainly lemmas and

morphological categories. We present the enhancements of the

corpus query system itself, the attributes we use to represent

syntactic structures in data, and some examples of querying the

syntactically annotated corpora, such as treebanks in various

languages as well as an automatically parsed large corpus.

P36 - Sentiment Analysis and OpinionMining (2)Thursday, May 26, 16:55

Chairperson: Manfred Stede Poster Session

Affective Lexicon Creation for the GreekLanguage

Elisavet Palogiannidi, Polychronis Koutsakis, Elias Iosifand Alexandros Potamianos

Starting from the English affective lexicon ANEW (Bradley and

Lang, 1999a) we have created the first Greek affective lexicon.

It contains human ratings for the three continuous affective

dimensions of valence, arousal and dominance for 1034 words.

The Greek affective lexicon is compared with affective lexica in

English, Spanish and Portuguese. The lexicon is automatically

expanded by selecting a small number of manually annotated

words to bootstrap the process of estimating affective ratings

of unknown words. We experimented with the parameters of

the semantic-affective model in order to investigate their impact

to its performance, which reaches 85% binary classification

accuracy (positive vs. negative ratings). We share the Greek

affective lexicon that consists of 1034 words and the automatically

expanded Greek affective lexicon that contains 407K words.

A Hungarian Sentiment Corpus ManuallyAnnotated at Aspect LevelMartina Katalin Szabó, Veronika Vincze, Katalin IlonaSimkó, Viktor Varga and Viktor Hangya

In this paper we present a Hungarian sentiment corpus manually

annotated at aspect level. Our corpus consists of Hungarian

opinion texts written about different types of products. The main

aim of creating the corpus was to produce an appropriate database

providing possibilities for developing text mining software tools.

The corpus is a unique Hungarian database: to the best of

our knowledge, no digitized Hungarian sentiment corpus that is

annotated on the level of fragments and targets has been made so

far. In addition, many language elements of the corpus, relevant

from the point of view of sentiment analysis, got distinct types

of tags in the annotation. In this paper, on the one hand, we

present the method of annotation, and we discuss the difficulties

concerning text annotation process. On the other hand, we provide

some quantitative and qualitative data on the corpus. We conclude

with a description of the applicability of the corpus.

Effect Functors for Opinion InferenceJosef Ruppenhofer and Jasper Brandes

Sentiment analysis has so far focused on the detection of explicit

opinions. However, of late implicit opinions have received broader

attention, the key idea being that the evaluation of an event type by

a speaker depends on how the participants in the event are valued

and how the event itself affects the participants. We present an

annotation scheme for adding relevant information, couched in

terms of so-called effect functors, to German lexical items. Our

scheme synthesizes and extends previous proposals. We report

on an inter-annotator agreement study. We also present results

of a crowdsourcing experiment to test the utility of some known

and some new functors for opinion inference where, unlike in

previous work, subjects are asked to reason from event evaluation

to participant evaluation.

Sentiframes: A Resource for Verb-centeredGerman Sentiment InferenceManfred Klenner and Michael Amsler

In this paper, a German verb resource for verb-centered sentiment

inference is introduced and evaluated. Our model specifies verb

100

polarity frames that capture the polarity effects on the fillers of

the verb’s arguments given a sentence with that verb frame. Verb

signatures and selectional restrictions are also part of the model.

An algorithm to apply the verb resource to treebank sentences and

the results of our first evaluation are discussed.

Annotating Sentiment and Irony in the OnlineItalian Political Debate on #labuonascuola

Marco Stranisci, Cristina Bosco, Delia Irazú HernándezFarías and Viviana Patti

In this paper we present the TWitterBuonaScuola corpus (TW-

BS), a novel Italian linguistic resource for Sentiment Analysis,

developed with the main aim of analyzing the online debate

on the controversial Italian political reform “Buona Scuola”

(Good school), aimed at reorganizing the national educational

and training systems. We describe the methodologies applied

in the collection and annotation of data. The collection has

been driven by the detection of the hashtags mainly used by

the participants to the debate, while the annotation has been

focused on sentiment polarity and irony, but also extended to

mark the aspects of the reform that were mainly discussed in the

debate. An in-depth study of the disagreement among annotators

is included. We describe the collection and annotation stages, and

the in-depth analysis of disagreement made with Crowdflower, a

crowdsourcing annotation platform.

NileULex: A Phrase and Word Level SentimentLexicon for Egyptian and Modern StandardArabic

Samhaa R. El-Beltagy

This paper presents NileULex, which is an Arabic sentiment

lexicon containing close to six thousands Arabic words and

compound phrases. Forty five percent of the terms and expressions

in the lexicon are Egyptian or colloquial while fifty five percent

are Modern Standard Arabic. While the collection of many of

the terms included in the lexicon was done automatically, the

actual addition of any term was done manually. One of the

important criterions for adding terms to the lexicon, was that they

be as unambiguous as possible. The result is a lexicon with a

much higher quality than any translated variant or automatically

constructed one. To demonstrate that a lexicon such as this

can directly impact the task of sentiment analysis, a very basic

machine learning based sentiment analyser that uses unigrams,

bigrams, and lexicon based features was applied on two different

Twitter datasets. The obtained results were compared to a baseline

system that only uses unigrams and bigrams. The same lexicon

based features were also generated using a publicly available

translation of a popular sentiment lexicon. The experiments show

that usage of the developed lexicon improves the results over both

the baseline and the publicly available lexicon.

OPFI: A Tool for Opinion Finding in PolishAleksander Wawer

The paper contains a description of OPFI: Opinion Finder for

the Polish Language, a freely available tool for opinion target

extraction. The goal of the tool is opinion finding: a task of

identifying tuples composed of sentiment (positive or negative)

and its target (about what or whom is the sentiment expressed).

OPFI is not dependent on any particular method of sentiment

identification and provides a built-in sentiment dictionary as a

convenient option. Technically, it contains implementations of

three different modes of opinion tuple generation: one hybrid

based on dependency parsing and CRF, the second based on

shallow parsing and the third on deep learning, namely GRU

neural network. The paper also contains a description of related

language resources: two annotated treebanks and one set of

tweets.

Rude waiter but mouthwatering pastries! Anexploratory study into Dutch Aspect-BasedSentiment AnalysisOrphee De Clercq and Veronique Hoste

The fine-grained task of automatically detecting all sentiment

expressions within a given document and the aspects to which

they refer is known as aspect-based sentiment analysis. In this

paper we present the first full aspect-based sentiment analysis

pipeline for Dutch and apply it to customer reviews. To this

purpose, we collected reviews from two different domains, i.e.

restaurant and smartphone reviews. Both corpora have been

manually annotated using newly developed guidelines that comply

to standard practices in the field. For our experimental pipeline

we perceive aspect-based sentiment analysis as a task consisting

of three main subtasks which have to be tackled incrementally:

aspect term extraction, aspect category classification and polarity

classification. First experiments on our Dutch restaurant corpus

reveal that this is indeed a feasible approach that yields promising

results.

P37 - Parallel and Comparable CorporaThursday, May 26, 16:55

Chairperson: Jörg Tiedemann Poster Session

Building A Case-based Semantic English-ChineseParallel TreebankHuaxing Shi, Tiejun Zhao and Keh-Yih Su

We construct a case-based English-to-Chinese semantic

constituent parallel Treebank for a Statistical Machine Translation

(SMT) task by labelling each node of the Deep Syntactic Tree

101

(DST) with our refined semantic cases. Since subtree span-

crossing is harmful in tree-based SMT, DST is adopted to alleviate

this problem. At the same time, we tailor an existing case set to

represent bilingual shallow semantic relations more precisely.

This Treebank is a part of a semantic corpus building project,

which aims to build a semantic bilingual corpus annotated with

syntactic, semantic cases and word senses. Data in our Treebank

is from the news domain of Datum corpus. 4,000 sentence pairs

are selected to cover various lexicons and part-of-speech (POS)

n-gram patterns as much as possible. This paper presents the

construction of this case Treebank. Also, we have tested the

effect of adopting DST structure in alleviating subtree span-

crossing. Our preliminary analysis shows that the compatibility

between Chinese and English trees can be significantly increased

by transforming the parse-tree into the DST. Furthermore, the

human agreement rate in annotation is found to be acceptable

(90% in English nodes, 75% in Chinese nodes).

Uzbek-English and Turkish-English MorphemeAlignment Corpora

Xuansong Li, Jennifer Tracey, Stephen Grimes andStephanie Strassel

Morphologically-rich languages pose problems for machine

translation (MT) systems, including word-alignment errors, data

sparsity and multiple affixes. Current alignment models at word-

level do not distinguish words and morphemes, thus yielding

low-quality alignment and subsequently affecting end translation

quality. Models using morpheme-level alignment can reduce the

vocabulary size of morphologically-rich languages and overcomes

data sparsity. The alignment data based on smallest units

reveals subtle language features and enhances translation quality.

Recent research proves such morpheme-level alignment (MA)

data to be valuable linguistic resources for SMT, particularly for

languages with rich morphology. In support of this research

trend, the Linguistic Data Consortium (LDC) created Uzbek-

English and Turkish-English alignment data which are manually

aligned at the morpheme level. This paper describes the

creation of MA corpora, including alignment and tagging process

and approaches, highlighting annotation challenges and specific

features of languages with rich morphology. The light tagging

annotation on the alignment layer adds extra value to the MA

data, facilitating users in flexibly tailoring the data for various MT

model training.

Parallel Sentence Extraction from ComparableCorpora with Neural Network Features

Chenhui Chu, Raj Dabre and Sadao Kurohashi

Parallel corpora are crucial for machine translation (MT), however

they are quite scarce for most language pairs and domains. As

comparable corpora are far more available, many studies have

been conducted to extract parallel sentences from them for MT. In

this paper, we exploit the neural network features acquired from

neural MT for parallel sentence extraction. We observe significant

improvements for both accuracy in sentence extraction and MT

performance.

TweetMT: A Parallel Microblog Corpus

Iñaki San Vicente, Iñaki Alegria, Cristina España-Bonet, Pablo Gamallo, Hugo Gonçalo Oliveira, EvaMartinez Garcia, Antonio Toral, Arkaitz Zubiaga and NoraAranberri

We introduce TweetMT, a parallel corpus of tweets in four

language pairs that combine five languages (Spanish from/to

Basque, Catalan, Galician and Portuguese), all of which have

an official status in the Iberian Peninsula. The corpus has been

created by combining automatic collection and crowdsourcing

approaches, and it is publicly available. It is intended for

the development and testing of microtext machine translation

systems. In this paper we describe the methodology followed to

build the corpus, and present the results of the shared task in which

it was tested.

The Scielo Corpus: a Parallel Corpus of ScientificPublications for Biomedicine

Mariana Neves, Antonio Jimeno Yepes and Aurélie Névéol

The biomedical scientific literature is a rich source of information

not only in the English language, for which it is more abundant,

but also in other languages, such as Portuguese, Spanish

and French. We present the first freely available parallel

corpus of scientific publications for the biomedical domain.

Documents from the ”Biological Sciences” and ”Health Sciences”

categories were retrieved from the Scielo database and parallel

titles and abstracts are available for the following language

pairs: Portuguese/English (about 86,000 documents in total),

Spanish/English (about 95,000 documents) and French/English

(about 2,000 documents). Additionally, monolingual data was

also collected for all four languages. Sentences in the parallel

corpus were automatically aligned and a manual analysis of 200

documents by native experts found that a minimum of 79%

of sentences were correctly aligned in all language pairs. We

demonstrate the utility of the corpus by running baseline machine

translation experiments. We show that for all language pairs,

a statistical machine translation system trained on the parallel

corpora achieves performance that rivals or exceeds the state of

the art in the biomedical domain. Furthermore, the corpora are

102

currently being used in the biomedical task in the First Conference

on Machine Translation (WMT’16).

Producing Monolingual and Parallel WebCorpora at the Same Time - SpiderLing andBitextor’s Love Affair

Nikola Ljubešic, Miquel Esplà-Gomis, Antonio Toral,Sergio Ortiz Rojas and Filip Klubicka

This paper presents an approach for building large monolingual

corpora and, at the same time, extracting parallel data by crawling

the top-level domain of a given language of interest. For gathering

linguistically relevant data from top-level domains we use the

SpiderLing crawler, modified to crawl data written in multiple

languages. The output of this process is then fed to Bitextor, a tool

for harvesting parallel data from a collection of documents. We

call the system combining these two tools Spidextor, a blend of the

names of its two crucial parts. We evaluate the described approach

intrinsically by measuring the accuracy of the extracted bitexts

from the Croatian top-level domain “.hr” and the Slovene top-level

domain “.si”, and extrinsically on the English-Croatian language

pair by comparing an SMT system built from the crawled data

with third-party systems. We finally present parallel datasets

collected with our approach for the English-Croatian, English-

Finnish, English-Serbian and English-Slovene language pairs.

P38 - Social MediaThursday, May 26, 16:55

Chairperson: Fei Xia Poster Session

Towards Using Social Media to IdentifyIndividuals at Risk for Preventable ChronicIllness

Dane Bell, Daniel Fried, Luwen Huangfu, Mihai Surdeanuand Stephen Kobourov

We describe a strategy for the acquisition of training data

necessary to build a social-media-driven early detection system

for individuals at risk for (preventable) type 2 diabetes mellitus

(T2DM). The strategy uses a game-like quiz with data and

questions acquired semi-automatically from Twitter. The

questions are designed to inspire participant engagement and

collect relevant data to train a public-health model applied to

individuals. Prior systems designed to use social media such

as Twitter to predict obesity (a risk factor for T2DM) operate

on entire communities such as states, counties, or cities, based

on statistics gathered by government agencies. Because there

is considerable variation among individuals within these groups,

training data on the individual level would be more effective,

but this data is difficult to acquire. The approach proposed

here aims to address this issue. Our strategy has two steps.

First, we trained a random forest classifier on data gathered from

(public) Twitter statuses and state-level statistics with state-of-the-

art accuracy. We then converted this classifier into a 20-questions-

style quiz and made it available online. In doing so, we achieved

high engagement with individuals that took the quiz, while also

building a training set of voluntarily supplied individual-level data

for future classification.

Can Tweets Predict TV Ratings?Bridget Sommerdijk, Eric Sanders and Antal van den Bosch

We set out to investigate whether TV ratings and mentions of TV

programmes on the Twitter social media platform are correlated.

If such a correlation exists, Twitter may be used as an alternative

source for estimating viewer popularity. Moreover, the Twitter-

based rating estimates may be generated during the programme,

or even before. We count the occurrences of programme-specific

hashtags in an archive of Dutch tweets of eleven popular TV

shows broadcast in the Netherlands in one season, and perform

correlation tests. Overall we find a strong correlation of 0.82; the

correlation remains strong, 0.79, if tweets are counted a half hour

before broadcast time. However, the two most popular TV shows

account for most of the positive effect; if we leave out the single

and second most popular TV shows, the correlation drops to being

moderate to weak. Also, within a TV show, correlations between

ratings and tweet counts are mostly weak, while correlations

between TV ratings of the previous and next shows are strong. In

absence of information on previous shows, Twitter-based counts

may be a viable alternative to classic estimation methods for

TV ratings. Estimates are more reliable with more popular TV

shows.

Classifying Out-of-vocabulary Terms in aDomain-Specific Social Media CorpusSoHyun Park, Afsaneh Fazly, Annie Lee, Brandon Seibel,Wenjie Zi and Paul Cook

In this paper we consider the problem of out-of-vocabulary term

classification in web forum text from the automotive domain. We

develop a set of nine domain- and application-specific categories

for out-of-vocabulary terms. We then propose a supervised

approach to classify out-of-vocabulary terms according to these

categories, drawing on features based on word embeddings, and

linguistic knowledge of common properties of out-of-vocabulary

terms. We show that the features based on word embeddings

are particularly informative for this task. The categories that

we predict could serve as a preliminary, automatically-generated

source of lexical knowledge about out-of-vocabulary terms.

Furthermore, we show that this approach can be adapted to give a

semi-automated method for identifying out-of-vocabulary terms

103

of a particular category, automotive named entities, that is of

particular interest to us.

Corpus for Customer Purchase BehaviorPrediction in Social MediaShigeyuki Sakaki, Francine Chen, Mandy Korpusik andYan-Ying Chen

Many people post about their daily life on social media. These

posts may include information about the purchase activity of

people, and insights useful to companies can be derived from

them: e.g. profile information of a user who mentioned something

about their product. As a further advanced analysis, we consider

extracting users who are likely to buy a product from the set of

users who mentioned that the product is attractive. In this paper,

we report our methodology for building a corpus for Twitter user

purchase behavior prediction. First, we collected Twitter users

who posted a want phrase + product name: e.g. “want a Xperia” as

candidate want users, and also candidate bought users in the same

way. Then, we asked an annotator to judge whether a candidate

user actually bought a product. We also annotated whether tweets

randomly sampled from want/bought user timelines are relevant

or not to purchase. In this annotation, 58% of want user tweets

and 35% of bought user tweets were annotated as relevant. Our

data indicate that information embedded in timeline tweets can be

used to predict purchase behavior of tweeted products.

Segmenting Hashtags using AutomaticallyCreated Training DataArda Celebi and Arzucan Özgür

Hashtags, which are commonly composed of multiple words,

are increasingly used to convey the actual messages in tweets.

Understanding what tweets are saying is getting more dependent

on understanding hashtags. Therefore, identifying the individual

words that constitute a hashtag is an important, yet a challenging

task due to the abrupt nature of the language used in tweets.

In this study, we introduce a feature-rich approach based on

using supervised machine learning methods to segment hashtags.

Our approach is unsupervised in the sense that instead of

using manually segmented hashtags for training the machine

learning classifiers, we automatically create our training data

by using tweets as well as by automatically extracting hashtag

segmentations from a large corpus. We achieve promising results

with such automatically created noisy training data.

Exploring Language Variation Across Europe - AWeb-based Tool for ComputationalSociolinguisticsDirk Hovy and Anders Johannsen

Language varies not only between countries, but also along

regional and socio-demographic lines. This variation is one of the

driving factors behind language change. However, investigating

language variation is a complex undertaking: the more factors we

want to consider, the more data we need. Traditional qualitative

methods are not well-suited to do this, an therefore restricted

to isolated factors. This reduction limits the potential insights,

and risks attributing undue importance to easily observed factors.

While there is a large interest in linguistics to increase the

quantitative aspect of such studies, it requires training in both

variational linguistics and computational methods, a combination

that is still not common. We take a first step here to alleviating the

problem by providing an interface, www.languagevariation.com,

to explore large-scale language variation along multiple socio-

demographic factors – without programming knowledge. It makes

use of large amounts of data and provides statistical analyses,

maps, and interactive features that will enable scholars to explore

language variation in a data-driven way.

Predicting Author Age from Weibo MicroblogPosts

Wanru Zhang, Andrew Caines, Dimitrios Alikaniotis andPaula Buttery

We report an author profiling study based on Chinese social media

texts gleaned from Sina Weibo in which we attempt to predict

the author’s age group based on various linguistic text features

mainly relating to non-standard orthography: classical Chinese

characters, hashtags, emoticons and kaomoji, homogeneous

punctuation and Latin character sequences, and poetic format.

We also tracked the use of selected popular Chinese expressions,

parts-of-speech and word types. We extracted 100 posts from

100 users in each of four age groups (under-18, 19-29, 30-39,

over-40 years) and by clustering users’ posts fifty at a time we

trained a maximum entropy classifier to predict author age group

to an accuracy of 65.5%. We show which features are associated

with younger and older age groups, and make our normalisation

resources available to other researchers.

Effects of Sampling on Twitter Trend Detection

Andrew Yates, Alek Kolcz, Nazli Goharian and OphirFrieder

Much research has focused on detecting trends on Twitter,

including health-related trends such as mentions of Influenza-like

illnesses or their symptoms. The majority of this research has been

conducted using Twitter’s public feed, which includes only about

1% of all public tweets. It is unclear if, when, and how using

Twitter’s 1% feed has affected the evaluation of trend detection

methods. In this work we use a larger feed to investigate the

effects of sampling on Twitter trend detection. We focus on using

health-related trends to estimate the prevalence of Influenza-like

104

illnesses based on tweets. We use ground truth obtained from

the CDC and Google Flu Trends to explore how the prevalence

estimates degrade when moving from a 100% to a 1% sample.

We find that using the 1% sample is unlikely to substantially

harm ILI estimates made at the national level, but can cause poor

performance when estimates are made at the city level.

Automatic Classification of Tweets for AnalyzingCommunication Behavior of Museums

Nicolas Foucault and Antoine Courtin

In this paper, we present a study on tweet classification which

aims to define the communication behavior of the 103 French

museums that participated in 2014 in the Twitter operation:

MuseumWeek. The tweets were automatically classified in

four communication categories: sharing experience, promoting

participation, interacting with the community, and promoting-

informing about the institution. Our classification is multi-class.

It combines Support Vector Machines and Naive Bayes methods

and is supported by a selection of eighteen subtypes of features

of four different kinds: metadata information, punctuation marks,

tweet-specific and lexical features. It was tested against a corpus

of 1,095 tweets manually annotated by two experts in Natural

Language Processing and Information Communication and twelve

Community Managers of French museums. We obtained an state-

of-the-art result of F1-score of 72% by 10-fold cross-validation.

This result is very encouraging since is even better than some

state-of-the-art results found in the tweet classification literature.

P39 - Word Sense Disambiguation (2)Thursday, May 26, 16:55

Chairperson: Elisabetta Jezek Poster Session

Graph-Based Induction of Word Senses inCroatian

Marko Bekavac and Jan Šnajder

Word sense induction (WSI) seeks to induce senses of words

from unannotated corpora. In this paper, we address the WSI

task for the Croatian language. We adopt the word clustering

approach based on co-occurrence graphs, in which senses are

taken to correspond to strongly inter-connected components of

co-occurring words. We experiment with a number of graph

construction techniques and clustering algorithms, and evaluate

the sense inventories both as a clustering problem and extrinsically

on a word sense disambiguation (WSD) task. In the cluster-

based evaluation, Chinese Whispers algorithm outperformed

Markov Clustering, yielding a normalized mutual information

score of 64.3. In contrast, in WSD evaluation Markov Clustering

performed better, yielding an accuracy of about 75%. We are

making available two induced sense inventories of 10,000 most

frequent Croatian words: one coarse-grained and one fine-grained

inventory, both obtained using the Markov Clustering algorithm.

A Multi-domain Corpus of Swedish Word SenseAnnotationRichard Johansson, Yvonne Adesam, Gerlof Bouma andKarin Hedberg

We describe the word sense annotation layer in Eukalyptus, a

freely available five-domain corpus of contemporary Swedish

with several annotation layers. The annotation uses the SALDO

lexicon to define the sense inventory, and allows word sense

annotation of compound segments and multiword units. We give

an overview of the new annotation tool developed for this project,

and finally present an analysis of the inter-annotator agreement

between two annotators.

QTLeap WSD/NED Corpora: SemanticAnnotation of Parallel Corpora in Six LanguagesArantxa Otegi, Nora Aranberri, António Branco, Jan Hajic,Martin Popel, Kiril Simov, Eneko Agirre, Petya Osenova,Rita Pereira, João Silva and Steven Neale

This work presents parallel corpora automatically annotated with

several NLP tools, including lemma and part-of-speech tagging,

named-entity recognition and classification, named-entity

disambiguation, word-sense disambiguation, and coreference.

The corpora comprise both the well-known Europarl corpus and a

domain-specific question-answer troubleshooting corpus on the

IT domain. English is common in all parallel corpora, with

translations in five languages, namely, Basque, Bulgarian, Czech,

Portuguese and Spanish. We describe the annotated corpora and

the tools used for annotation, as well as annotation statistics for

each language. These new resources are freely available and will

help research on semantic processing for machine translation and

cross-lingual transfer.

Combining Semantic Annotation of Word Sense& Semantic Roles: A Novel Annotation Schemefor VerbNet Roles on German Language DataÉva Mújdricza-Maydt, Silvana Hartmann, Iryna Gurevychand Anette Frank

We present a VerbNet-based annotation scheme for semantic roles

that we explore in an annotation study on German language

data that combines word sense and semantic role annotation.

We reannotate a substantial portion of the SALSA corpus with

GermaNet senses and a revised scheme of VerbNet roles. We

provide a detailed evaluation of the interaction between sense and

role annotation. The resulting corpus will allow us to compare

VerbNet role annotation for German to FrameNet and PropBank

annotation by mapping to existing role annotations on the SALSA

105

corpus. We publish the annotated corpus and detailed guidelines

for the new role annotation scheme.

Synset Ranking of Hindi WordNet

Sudha Bhingardive, Rajita Shukla, Jaya Saraswati, LaxmiKashyap, Dhirendra Singh and Pushpak Bhattacharya

Word Sense Disambiguation (WSD) is one of the open problems

in the area of natural language processing. Various supervised,

unsupervised and knowledge based approaches have been

proposed for automatically determining the sense of a word in a

particular context. It has been observed that such approaches often

find it difficult to beat the WordNet First Sense (WFS) baseline

which assigns the sense irrespective of context. In this paper, we

present our work on creating the WFS baseline for Hindi language

by manually ranking the synsets of Hindi WordNet. A ranking

tool is developed where human experts can see the frequency of

the word senses in the sense-tagged corpora and have been asked

to rank the senses of a word by using this information and also

his/her intuition. The accuracy of WFS baseline is tested on

several standard datasets. F-score is found to be 60%, 65% and

55% on Health, Tourism and News datasets respectively. The

created rankings can also be used in other NLP applications viz.,

Machine Translation, Information Retrieval, Text Summarization,

etc.

Neural Embedding Language Models in SemanticClustering of Web Search Results

Andrey Kutuzov and Elizaveta Kuzmenko

In this paper, a new approach towards semantic clustering of the

results of ambiguous search queries is presented. We propose

using distributed vector representations of words trained with

the help of prediction-based neural embedding models to detect

senses of search queries and to cluster search engine results page

according to these senses. The words from titles and snippets

together with semantic relationships between them form a graph,

which is further partitioned into components related to different

query senses. This approach to search engine results clustering is

evaluated against a new manually annotated evaluation data set of

Russian search queries. We show that in the task of semantically

clustering search results, prediction-based models slightly but

stably outperform traditional count-based ones, with the same

training corpora.

O29 - Panel on International Initiatives fromPublic AgenciesThursday, May 26, 16:55

Chairperson: Khalid Choukri Oral Session

O30 - Multimodality, Multimedia andEvaluationThursday, May 26, 16:55

Chairperson: Nick Campbell Oral Session

Impact of Automatic Segmentation on the Quality,Productivity and Self-reported Post-editing Effortof Intralingual SubtitlesAitor Alvarez, Marina Balenciaga, Arantza del Pozo,Haritz Arzelus, Anna Matamala and Carlos-D. Martínez-Hinarejos

This paper describes the evaluation methodology followed to

measure the impact of using a machine learning algorithm to

automatically segment intralingual subtitles. The segmentation

quality, productivity and self-reported post-editing effort achieved

with such approach are shown to improve those obtained by the

technique based in counting characters, mainly employed for

automatic subtitle segmentation currently. The corpus used to

train and test the proposed automated segmentation method is also

described and shared with the community, in order to foster further

research in this area.

1 Million Captioned Dutch Newspaper ImagesDesmond Elliott and Martijn Kleppe

Images naturally appear alongside text in a wide variety of media,

such as books, magazines, newspapers, and in online articles. This

type of multi-modal data offers an interesting basis for vision and

language research but most existing datasets use crowdsourced

text, which removes the images from their original context. In this

paper, we introduce the KBK-1M dataset of 1.6 million images

in their original context, with co-occurring texts found in Dutch

newspapers from 1922 - 1994. The images are digitally scanned

photographs, cartoons, sketches, and weather forecasts; the text

is generated from OCR scanned blocks. The dataset is suitable

for experiments in automatic image captioning, image–article

matching, object recognition, and data-to-text generation for

weather forecasting. It can also be used by humanities scholars to

analyse photographic style changes, the representation of people

and societal issues, and new tools for exploring photograph reuse

via image-similarity-based search.

Cross-validating Image Description Datasets andEvaluation MetricsJosiah Wang and Robert Gaizauskas

The task of automatically generating sentential descriptions of

image content has become increasingly popular in recent years,

resulting in the development of large-scale image description

datasets and the proposal of various metrics for evaluating image

description generation systems. However, not much work has

106

been done to analyse and understand both datasets and the

metrics. In this paper, we propose using a leave-one-out cross

validation (LOOCV) process as a means to analyse multiply

annotated, human-authored image description datasets and the

various evaluation metrics, i.e. evaluating one image description

against other human-authored descriptions of the same image.

Such an evaluation process affords various insights into the image

description datasets and evaluation metrics, such as the variations

of image descriptions within and across datasets and also what

the metrics capture. We compute and analyse (i) human upper-

bound performance; (ii) ranked correlation between metric pairs

across datasets; (iii) lower-bound performance by comparing a

set of descriptions describing one image to another sentence not

describing that image. Interesting observations are made about

the evaluation metrics and image description datasets, and we

conclude that such cross-validation methods are extremely useful

for assessing and gaining insights into image description datasets

and evaluation metrics for image descriptions.

Detection of Major ASL Sign Types in ContinuousSigning For ASL Recognition

Polina Yanovich, Carol Neidle and Dimitris Metaxas

In American Sign Language (ASL) as well as other signed

languages, different classes of signs (e.g., lexical signs,

fingerspelled signs, and classifier constructions) have different

internal structural properties. Continuous sign recognition

accuracy can be improved through use of distinct recognition

strategies, as well as different training datasets, for each class

of signs. For these strategies to be applied, continuous

signing video needs to be segmented into parts corresponding to

particular classes of signs. In this paper we present a multiple

instance learning-based segmentation system that accurately

labels 91.27% of the video frames of 500 continuous utterances

(including 7 different subjects) from the publicly accessible

NCSLGR corpus http://secrets.rutgers.edu/dai/

queryPages (Neidle and Vogler, 2012). The system uses novel

feature descriptors derived from both motion and shape statistics

of the regions of high local motion. The system does not require a

hand tracker.

O31 - Summarisation and SimplificationThursday, May 26, 16:55

Chairperson: Udo Kruschwitz Oral Session

Benchmarking Lexical Simplification Systems

Gustavo Paetzold and Lucia Specia

Lexical Simplification is the task of replacing complex words

in a text with simpler alternatives. A variety of strategies

have been devised for this challenge, yet there has been little

effort in comparing their performance. In this contribution,

we present a benchmarking of several Lexical Simplification

systems. By combining resources created in previous work

with automatic spelling and inflection correction techniques,

we introduce BenchLS: a new evaluation dataset for the task.

Using BenchLS, we evaluate the performance of solutions for

various steps in the typical Lexical Simplification pipeline,

both individually and jointly. This is the first time Lexical

Simplification systems are compared in such fashion on the same

data, and the findings introduce many contributions to the field,

revealing several interesting properties of the systems evaluated.

A Multi-Layered Annotated Corpus of ScientificPapers

Beatriz Fisas, Francesco Ronzano and Horacio Saggion

Scientific literature records the research process with a

standardized structure and provides the clues to track the

progress in a scientific field. Understanding its internal structure

and content is of paramount importance for natural language

processing (NLP) technologies. To meet this requirement, we

have developed a multi-layered annotated corpus of scientific

papers in the domain of Computer Graphics. Sentences are

annotated with respect to their role in the argumentative structure

of the discourse. The purpose of each citation is specified.

Special features of the scientific discourse such as advantages and

disadvantages are identified. In addition, a grade is allocated to

each sentence according to its relevance for being included in

a summary.To the best of our knowledge, this complex, multi-

layered collection of annotations and metadata characterizing a

set of research papers had never been grouped together before in

one corpus and therefore constitutes a newer, richer resource with

respect to those currently available in the field.

Extractive Summarization under Strict LengthConstraints

Yashar Mehdad, Amanda Stent, Kapil Thadani, DragomirRadev, Youssef Billawala and Karolina Buchner

In this paper we report a comparison of various techniques

for single-document extractive summarization under strict length

budgets, which is a common commercial use case (e.g.

summarization of news articles by news aggregators). We show

that, evaluated using ROUGE, numerous algorithms from the

literature fail to beat a simple lead-based baseline for this task.

However, a supervised approach with lightweight and efficient

features improves over the lead-based baseline. Additional

human evaluation demonstrates that the supervised approach also

107

performs competitively with a commercial system that uses more

sophisticated features.

What’s the Issue Here?: Task-based Evaluation ofReader Comment Summarization Systems

Emma Barker, Monica Paramita, Adam Funk, EminaKurtic, Ahmet Aker, Jonathan Foster, Mark Hepple andRobert Gaizauskas

Automatic summarization of reader comments in on-line news

is an extremely challenging task and a capability for which

there is a clear need. Work to date has focussed on producing

extractive summaries using well-known techniques imported from

other areas of language processing. But are extractive summaries

of comments what users really want? Do they support users

in performing the sorts of tasks they are likely to want to

perform with reader comments? In this paper we address these

questions by doing three things. First, we offer a specification

of one possible summary type for reader comment, based on an

analysis of reader comment in terms of issues and viewpoints.

Second, we define a task-based evaluation framework for reader

comment summarization that allows summarization systems to

be assessed in terms of how well they support users in a time-

limited task of identifying issues and characterising opinion on

issues in comments. Third, we describe a pilot evaluation in

which we used the task-based evaluation framework to evaluate a

prototype reader comment clustering and summarization system,

demonstrating the viability of the evaluation framework and

illustrating the sorts of insight such an evaluation affords.

O32 - Morphology (2)Thursday, May 26, 16:55

Chairperson: Marko Tadic Oral Session

A Novel Evaluation Method for MorphologicalSegmentation

Javad Nouri and Roman Yangarber

Unsupervised learning of morphological segmentation of words

in a language, based only on a large corpus of words, is a

challenging task. Evaluation of the learned segmentations is

a challenge in itself, due to the inherent ambiguity of the

segmentation task. There is no way to posit unique “correct”

segmentation for a set of data in an objective way. Two models

may arrive at different ways of segmenting the data, which may

nonetheless both be valid. Several evaluation methods have been

proposed to date, but they do not insist on consistency of the

evaluated model. We introduce a new evaluation methodology,

which enforces correctness of segmentation boundaries while also

assuring consistency of segmentation decisions across the corpus.

Bilingual Lexicon Extraction at the MorphemeLevel Using Distributional Analysis

Amir Hazem and Béatrice Daille

Bilingual lexicon extraction from comparable corpora is usually

based on distributional methods when dealing with single word

terms (SWT). These methods often treat SWT as single tokens

without considering their compositional property. However, many

SWT are compositional (composed of roots and affixes) and

this information, if taken into account can be very useful to

match translational pairs, especially for infrequent terms where

distributional methods often fail. For instance, the English

compound xenograft which is composed of the root xeno and

the lexeme graft can be translated into French compositionally

by aligning each of its elements (xeno with xéno and graft with

greffe) resulting in the translation: xénogreffe. In this paper,

we experiment several distributional modellings at the morpheme

level that we apply to perform compositional translation to a

subset of French and English compounds. We show promising

results using distributional analysis at the root and affix levels.

We also show that the adapted approach significantly improve

bilingual lexicon extraction from comparable corpora compared

to the approach at the word level.

Remote Elicitation of Inflectional Paradigms toSeed Morphological Analysis in Low-ResourceLanguages

John Sylak-Glassman, Christo Kirov and David Yarowsky

Structured, complete inflectional paradigm data exists for very few

of the world’s languages, but is crucial to training morphological

analysis tools. We present methods inspired by linguistic

fieldwork for gathering inflectional paradigm data in a machine-

readable, interoperable format from remotely-located speakers of

any language. Informants are tasked with completing language-

specific paradigm elicitation templates. Templates are constructed

by linguists using grammatical reference materials to ensure

completeness. Each cell in a template is associated with

contextual prompts designed to help informants with varying

levels of linguistic expertise (from professional translators to

untrained native speakers) provide the desired inflected form. To

facilitate downstream use in interoperable NLP/HLT applications,

each cell is also associated with a language-independent machine-

readable set of morphological tags from the UniMorph Schema.

This data is useful for seeding morphological analysis and

generation software, particularly when the data is representative

of the range of surface morphological variation in the language.

108

At present, we have obtained 792 lemmas and 25,056 inflected

forms from 15 languages.

Very-large Scale Parsing and Normalization ofWiktionary Morphological Paradigms

Christo Kirov, John Sylak-Glassman, Roger Que and DavidYarowsky

Wiktionary is a large-scale resource for cross-lingual lexical

information with great potential utility for machine translation

(MT) and many other NLP tasks, especially automatic

morphological analysis and generation. However, it is designed

primarily for human viewing rather than machine readability,

and presents numerous challenges for generalized parsing

and extraction due to a lack of standardized formatting and

grammatical descriptor definitions. This paper describes a

large-scale effort to automatically extract and standardize the

data in Wiktionary and make it available for use by the NLP

research community. The methodological innovations include a

multidimensional table parsing algorithm, a cross-lexeme, token-

frequency-based method of separating inflectional form data

from grammatical descriptors, the normalization of grammatical

descriptors to a unified annotation scheme that accounts for cross-

linguistic diversity, and a verification and correction process that

exploits within-language, cross-lexeme table format consistency

to minimize human effort. The effort described here resulted

in the extraction of a uniquely large normalized resource of

nearly 1,000,000 inflectional paradigms across 350 languages.

Evaluation shows that even though the data is extracted using

a language-independent approach, it is comparable in quantity

and quality to data extracted using hand-tuned, language-specific

approaches.

P40 - Dialogue (1)Thursday, May 26, 18:20

Chairperson: Jens Edlund Poster Session

AppDialogue: Multi-App Dialogues for IntelligentAssistants

Ming Sun, Yun-Nung Chen, Zhenhao Hua, Yulian Tamres-Rudnicky, Arnab Dash and Alexander Rudnicky

Users will interact with an individual app on smart devices

(e.g., phone, TV, car) to fulfill a specific goal (e.g. find

a photographer), but users may also pursue more complex

tasks that will span multiple domains and apps (e.g. plan a

wedding ceremony). Planning and executing such multi-app

tasks are typically managed by users, considering the required

global context awareness. To investigate how users arrange

domains/apps to fulfill complex tasks in their daily life, we

conducted a user study on 14 participants to collect such data

from their Android smart phones. This document 1) summarizes

the techniques used in the data collection and 2) provides a brief

statistical description of the data. This data guilds the future

direction for researchers in the fields of conversational agent

and personal assistant, etc. This data is available at http:

//AppDialogue.com.

Modelling Multi-issue Bargaining Dialogues: DataCollection, Annotation Design and Corpus

Volha Petukhova, Christopher Stevens, Harmen de Weerd,Niels Taatgen, Fokie Cnossen and Andrei Malchanau

The paper describes experimental dialogue data collection

activities, as well semantically annotated corpus

creation undertaken within EU-funded METALOGUE

project(www.metalogue.eu). The project aims to develop a

dialogue system with flexible dialogue management to enable

system’s adaptive, reactive, interactive and proactive dialogue

behavior in setting goals, choosing appropriate strategies and

monitoring numerous parallel interpretation and management

processes. To achieve these goals negotiation (or more precisely

multi-issue bargaining) scenario has been considered as the

specific setting and application domain. The dialogue corpus

forms the basis for the design of task and interaction models of

participants negotiation behavior, and subsequently for dialogue

system development which would be capable to replace one of

the negotiators. The METALOGUE corpus will be released to the

community for research purposes.

The Negochat Corpus of Human-agentNegotiation Dialogues

Vasily Konovalov, Ron Artstein, Oren Melamud and IdoDagan

Annotated in-domain corpora are crucial to the successful

development of dialogue systems of automated agents, and in

particular for developing natural language understanding (NLU)

components of such systems. Unfortunately, such important

resources are scarce. In this work, we introduce an annotated

natural language human-agent dialogue corpus in the negotiation

domain. The corpus was collected using Amazon Mechanical

Turk following the ‘Wizard-Of-Oz’ approach, where a ‘wizard’

human translates the participants’ natural language utterances in

real time into a semantic language. Once dialogue collection

was completed, utterances were annotated with intent labels

by two independent annotators, achieving high inter-annotator

agreement. Our initial experiments with an SVM classifier show

that automatically inferring such labels from the utterances is far

109

from trivial. We make our corpus publicly available to serve as an

aid in the development of dialogue systems for negotiation agents,

and suggest that analogous corpora can be created following our

methodology and using our available source code. To the best

of our knowledge this is the first publicly available negotiation

dialogue corpus.

The dialogue breakdown detection challenge:Task description, datasets, and evaluation metrics

Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Kobayashiand Michimasa Inaba

Dialogue breakdown detection is a promising technique in

dialogue systems. To promote the research and development of

such a technique, we organized a dialogue breakdown detection

challenge where the task is to detect a system’s inappropriate

utterances that lead to dialogue breakdowns in chat. This paper

describes the design, datasets, and evaluation metrics for the

challenge as well as the methods and results of the submitted runs

of the participants.

The DialogBank

Harry Bunt, Volha Petukhova, Andrei Malchanau, KarsWijnhoven and Alex Fang

This paper presents the DialogBank, a new language resource

consisting of dialogues with gold standard annotations according

to the ISO 24617-2 standard. Some of these dialogues have

been taken from existing corpora and have been re-annotated

according to the ISO standard; others have been annotated directly

according to the standard. The ISO 24617-2 annotations have

been designed according to the ISO principles for semantic

annotation, as formulated in ISO 24617-6. The DialogBank makes

use of three alternative representation formats, which are shown to

be interoperable.

Coordinating Communication in the Wild: TheArtwalk Dialogue Corpus of PedestrianNavigation and Mobile ReferentialCommunication

Kris Liu, Jean Fox Tree and Marilyn Walker

The Artwalk Corpus is a collection of 48 mobile phone

conversations between 24 pairs of friends and 24 pairs of

strangers performing a novel, naturalistically-situated referential

communication task. This task produced dialogues which, on

average, are just under 40 minutes. The task requires the

identification of public art while walking around and navigating

pedestrian routes in the downtown area of Santa Cruz, California.

The task involves a Director on the UCSC campus with access

to maps providing verbal instructions to a Follower executing

the task. The task provides a setting for real-world situated

dialogic language and is designed to: (1) elicit entrainment

and coordination of referring expressions between the dialogue

participants, (2) examine the effect of friendship on dialogue

strategies, and (3) examine how the need to complete the task

while negotiating myriad, unanticipated events in the real world

– such as avoiding cars and other pedestrians – affects linguistic

coordination and other dialogue behaviors. Previous work

on entrainment and coordinating communication has primarily

focused on similar tasks in laboratory settings where there are no

interruptions and no need to navigate from one point to another

in a complex space. The corpus provides a general resource for

studies on how coordinated task-oriented dialogue changes when

we move outside the laboratory and into the world. It can also

be used for studies of entrainment in dialogue, and the form and

style of pedestrian instruction dialogues, as well as the effect of

friendship on dialogic behaviors.

Managing Linguistic and TerminologicalVariation in a Medical Dialogue SystemLeonardo Campillos Llanos, Dhouha Bouamor, PierreZweigenbaum and Sophie Rosset

We introduce a dialogue task between a virtual patient and a doctor

where the dialogue system, playing the patient part in a simulated

consultation, must reconcile a specialized level, to understand

what the doctor says, and a lay level, to output realistic patient-

language utterances. This increases the challenges in the analysis

and generation phases of the dialogue. This paper proposes

methods to manage linguistic and terminological variation in

that situation and illustrates how they help produce realistic

dialogues. Our system makes use of lexical resources for

processing synonyms, inflectional and derivational variants, or

pronoun/verb agreement. In addition, specialized knowledge is

used for processing medical roots and affixes, ontological relations

and concept mapping, and for generating lay variants of terms

according to the patient’s non-expert discourse. We also report the

results of a first evaluation carried out by 11 users interacting with

the system. We evaluated the non-contextual analysis module,

which supports the Spoken Language Understanding step. The

annotation of task domain entities obtained 91.8% of Precision,

82.5% of Recall, 86.9% of F-measure, 19.0% of Slot Error Rate,

and 32.9% of Sentence Error Rate.

A Corpus of Word-Aligned Asked and AnticipatedQuestions in a Virtual Patient Dialogue SystemAjda Gokcen, Evan Jaffe, Johnsey Erdmann, Michael Whiteand Douglas Danforth

We present a corpus of virtual patient dialogues to which we

have added manually annotated gold standard word alignments.

Since each question asked by a medical student in the dialogues

110

is mapped to a canonical, anticipated version of the question,

the corpus implicitly defines a large set of paraphrase (and non-

paraphrase) pairs. We also present a novel process for selecting

the most useful data to annotate with word alignments and for

ensuring consistent paraphrase status decisions. In support of

this process, we have enhanced the earlier Edinburgh alignment

tool (Cohn et al., 2008) and revised and extended the Edinburgh

guidelines, in particular adding guidance intended to ensure that

the word alignments are consistent with the overall paraphrase

status decision. The finished corpus and the enhanced alignment

tool are made freely available.

A CUP of CoFee: A large Collection of feedbackUtterances Provided with communicative functionannotations

Laurent Prévot, Jan Gorisch and Roxane Bertrand

There have been several attempts to annotate communicative

functions to utterances of verbal feedback in English previously.

Here, we suggest an annotation scheme for verbal and non-verbal

feedback utterances in French including the categories base,

attitude, previous and visual. The data comprises conversations,

maptasks and negotiations from which we extracted ca. 13,000

candidate feedback utterances and gestures. 12 students were

recruited for the annotation campaign of ca. 9,500 instances. Each

instance was annotated by between 2 and 7 raters. The evaluation

of the annotation agreement resulted in an average best-pair kappa

of 0.6. While the base category with the values acknowledgement,

evaluation, answer, elicit achieve good agreement, this is not the

case for the other main categories. The data sets, which also

include automatic extractions of lexical, positional and acoustic

features, are freely available and will further be used for machine

learning classification experiments to analyse the form-function

relationship of feedback.

P41 - Language LearningThursday, May 26, 18:20

Chairperson: Costanza Navarretta Poster Session

Palabras: Crowdsourcing Transcriptions of L2Speech

Eric Sanders, Pepi Burgos, Catia Cucchiarini and Roelandvan Hout

We developed a web application for crowdsourcing transcriptions

of Dutch words spoken by Spanish L2 learners. In this paper we

discuss the design of the application and the influence of metadata

and various forms of feedback. Useful data were obtained from

159 participants, with an average of over 20 transcriptions per

item, which seems a satisfactory result for this type of research.

Informing participants about how many items they still had to

complete, and not how many they had already completed, turned

to be an incentive to do more items. Assigning participants a

score for their performance made it more attractive for them to

carry out the transcription task, but this seemed to influence their

performance. We discuss possible advantages and disadvantages

in connection with the aim of the research and consider possible

lessons for designing future experiments.

The Uppsala Corpus of Student Writings: CorpusCreation, Annotation, and Analysis

Beata Megyesi, Jesper Näsman and Anne Palmér

The Uppsala Corpus of Student Writings consists of Swedish

texts produced as part of a national test of students ranging in

age from nine (in year three of primary school) to nineteen (the

last year of upper secondary school) who are studying either

Swedish or Swedish as a second language. National tests have

been collected since 1996. The corpus currently consists of 2,500

texts containing over 1.5 million tokens. Parts of the texts have

been annotated on several linguistic levels using existing state-of-

the-art natural language processing tools. In order to make the

corpus easy to interpret for scholars in the humanities, we chose

the CoNLL format instead of an XML-based representation. Since

spelling and grammatical errors are common in student writings,

the texts are automatically corrected while keeping the original

tokens in the corpus. Each token is annotated with part-of-speech

and morphological features as well as syntactic structure. The

main purpose of the corpus is to facilitate the systematic and

quantitative empirical study of the writings of various student

groups based on gender, geographic area, age, grade awarded or

a combination of these, synchronically or diachronically. The

intention is for this to be a monitor corpus, currently under

development.

Corpus for Children’s Writing with EnhancedOutput for Specific Spelling Patterns (2nd and 3rdGrade)

Kay Berkling

This paper describes the collection of the H1 Corpus of children’s

weekly writing over the course of 3 months in 2nd and 3rd grades,

aged 7-11. The texts were collected within the normal classroom

setting by the teacher. Texts of children whose parents signed

the permission to donate the texts to science were collected and

transcribed. The corpus consists of the elicitation techniques, an

overview of the data collected and the transcriptions of the texts

both with and without spelling errors, aligned on a word by word

basis, as well as the scanned in texts. The corpus is available

for research via Linguistic Data Consortium (LDC). Researchers

111

are strongly encouraged to make additional annotations and

improvements and return it to the public domain via LDC.

The COPLE2 corpus: a learner corpus forPortuguese

Amália Mendes, Sandra Antunes, Maarten Janssen andAnabela Gonçalves

We present the COPLE2 corpus, a learner corpus of Portuguese

that includes written and spoken texts produced by learners

of Portuguese as a second or foreign language. The corpus

includes at the moment a total of 182,474 tokens and 978 texts,

classified according to the CEFR scales. The original handwritten

productions are transcribed in TEI compliant XML format and

keep record of all the original information, such as reformulations,

insertions and corrections made by the teacher, while the

recordings are transcribed and aligned with EXMARaLDA.

The TEITOK environment enables different views of the same

document (XML, student version, corrected version), a CQP-

based search interface, the POS, lemmatization and normalization

of the tokens, and will soon be used for error annotation in stand-

off format. The corpus has already been a source of data for

phonological, lexical and syntactic interlanguage studies and will

be used for a data-informed selection of language features for each

proficiency level.

French Learners Audio Corpus of German Speech(FLACGS)

Jane Wottawa and Martine Adda-Decker

The French Learners Audio Corpus of German Speech (FLACGS)

was created to compare German speech production of German

native speakers (GG) and French learners of German (FG)

across three speech production tasks of increasing production

complexity: repetition, reading and picture description. 40

speakers, 20 GG and 20 FG performed each of the three tasks,

which in total leads to approximately 7h of speech. The corpus

was manually transcribed and automatically aligned. Analysis

that can be performed on this type of corpus are for instance

segmental differences in the speech production of L2 learners

compared to native speakers. We chose the realization of the

velar nasal consonant engma. In spoken French, engma does not

appear in a VCV context which leads to production difficulties in

FG. With increasing speech production complexity (reading and

picture description), engma is realized as engma + plosive by FG

in over 50% of the cases. The results of a two way ANOVA with

unequal sample sizes on the durations of the different realizations

of engma indicate that duration is a reliable factor to distinguish

between engma and engma + plosive in FG productions compared

to the engma productions in GG in a VCV context. The FLACGS

corpus allows to study L2 production and perception.

Croatian Error-Annotated Corpus ofNon-Professional Written Language

Vanja Štefanec, Nikola Ljubešic and Jelena Kuvac Kraljevic

In the paper authors present the Croatian corpus of non-

professional written language. Consisting of two subcorpora,

i.e. the clinical subcorpus, consisting of written texts produced

by speakers with various types of language disorders, and the

healthy speakers subcorpus, as well as by the levels of its

annotation, it offers an opportunity for different lines of research.

The authors present the corpus structure, describe the sampling

methodology, explain the levels of annotation, and give some very

basic statistics. On the basis of data from the corpus, existing

language technologies for Croatian are adapted in order to be

implemented in a platform facilitating text production to speakers

with language disorders. In this respect, several analyses of the

corpus data and a basic evaluation of the developed technologies

are presented.

P42 - Less-Resourced LanguagesThursday, May 26, 18:20

Chairperson: Laurette Pretorius Poster Session

Training & Quality Assessment of an OpticalCharacter Recognition Model for Northern Haida

Isabell Hubert, Antti Arppe, Jordan Lachler and EddieAntonio Santos

We are presenting our work on the creation of the first optical

character recognition (OCR) model for Northern Haida, also

known as Masset or Xaad Kil, a nearly extinct First Nations

language spoken in the Haida Gwaii archipelago in British

Columbia, Canada. We are addressing the challenges of training

an OCR model for a language with an extensive, non-standard

Latin character set as follows: (1) We have compared various

training approaches and present the results of practical analyses

to maximize recognition accuracy and minimize manual labor.

An approach using just one or two pages of Source Images

directly performed better than the Image Generation approach,

and better than models based on three or more pages. Analyses

also suggest that a character’s frequency is directly correlated

with its recognition accuracy. (2) We present an overview of

current OCR accuracy analysis tools available. (3) We have ported

the once de-facto standardized OCR accuracy tools to be able

to cope with Unicode input. Our work adds to a growing body

of research on OCR for particularly challenging character sets,

112

and contributes to creating the largest electronic corpus for this

severely endangered language.

Legacy language atlas data mining: mapping Krulanguages

Dafydd Gibbon

An online tool based on dialectometric methods, DistGraph, is

applied to a group of Kru languages of Côte d’Ivoire, Liberia

and Burkina Faso. The inputs to this resource consist of tables

of languages x linguistic features (e.g. phonological, lexical or

grammatical), and statistical and graphical outputs are generated

which show similarities and differences between the languages

in terms of the features as virtual distances. In the present

contribution, attention is focussed on the consonant systems of the

languages, a traditional starting point for language comparison.

The data are harvested from a legacy language data resource based

on fieldwork in the 1970s and 1980s, a language atlas of the

Kru languages. The method on which the online tool is based

extends beyond documentation of individual languages to the

documentation of language groups, and supports difference-based

prioritisation in education programmes, decisions on language

policy and documentation and conservation funding, as well as

research on language typology and heritage documentation of

history and migration.

Data Formats and Management Strategies fromthe Perspective of Language Resource Producers– Personal Diachronic and Social Synchronic DataSharing –

Kazushi Ohya

This is a report of findings from on-going language documentation

research based on three consecutive projects from 2008 to 2016.

In the light of this research, we propose that (1) we should stand

on the side of language resource producers to enhance the research

of language processing. We support personal data management

in addition to social data sharing. (2) This support leads to

adopting simple data formats instead of the multi-link-path data

models proposed as international standards up to the present.

(3) We should set up a framework for total language resource

study that includes not only pivotal data formats such as standard

formats, but also the surroundings of data formation to capture

a wider range of language activities, e.g. annotation, hesitant

language formation, and reference-referent relations. A study of

this framework is expected to be a foundation of rebuilding man-

machine interface studies in which we seek to observe generative

processes of informational symbols in order to establish a high

affinity interface in regard to documentation.

Curation of Dutch Regional DictionariesHenk van den Heuvel, Eric Sanders and Nicoline van derSijs

This paper describes the process of semi-automatically converting

dictionaries from paper to structured text (database) and the

integration of these into the CLARIN infrastructure in order

to make the dictionaries accessible and retrievable for the

research community. The case study at hand is that of the

curation of 42 fascicles of the Dictionaries of the Brabantic and

Limburgian dialects, and 6 fascicles of the Dictionary of dialects

in Gelderland.

Fostering digital representation of EU regionaland minority languages: the Digital LanguageDiversity ProjectClaudia Soria, Irene Russo, Valeria Quochi, Davyth Hicks,Antton Gurrutxaga, Anneli Sarhimaa and Matti Tuomisto

Poor digital representation of minority languages further prevents

their usability on digital media and devices. The Digital Language

Diversity Project, a three-year project funded under the Erasmus+

programme, aims at addressing the problem of low digital

representation of EU regional and minority languages by giving

their speakers the intellectual an practical skills to create, share,

and reuse online digital content. Availability of digital content

and technical support to use it are essential prerequisites for the

development of language-based digital applications, which in turn

can boost digital usage of these languages. In this paper we

introduce the project, its aims, objectives and current activities for

sustaining digital usability of minority languages through adult

education.

Cysill Ar-lein: A Corpus of WrittenContemporary Welsh Compiled from an On-lineSpelling and Grammar CheckerDelyth Prys, Gruffudd Prys and Dewi Bryn Jones

This paper describes the use of a free, on-line language spelling

and grammar checking aid as a vehicle for the collection of

a significant (31 million words and rising) corpus of text for

academic research in the context of less resourced languages

where such data in sufficient quantities are often unavailable. It

describes two versions of the corpus: the texts as submitted,

prior to the correction process, and the texts following the user’s

incorporation of any suggested changes. An overview of the

corpus’ contents is given and an analysis of use including usage

statistics is also provided. Issues surrounding privacy and the

anonymization of data are explored as is the data’s potential use

for linguistic analysis, lexical research and language modelling.

The method used for gathering this corpus is believed to be

113

unique, and is a valuable addition to corpus studies in a minority

language.

ALT Explored: Integrating an OnlineDialectometric Tool and an Online Dialect AtlasMartijn Wieling, Eva Sassolini, Sebastiana Cucurullo andSimonetta Montemagni

In this paper, we illustrate the integration of an online

dialectometric tool, Gabmap, together with an online dialect

atlas, the Atlante Lessicale Toscano (ALT-Web). By using

a newly created url-based interface to Gabmap, ALT-Web is

able to take advantage of the sophisticated dialect visualization

and exploration options incorporated in Gabmap. For example,

distribution maps showing the distribution in the Tuscan dialect

area of a specific dialectal form (selected via the ALT-Web

website) are easily obtainable. Furthermore, the complete ALT-

Web dataset as well as subsets of the data (selected via the ALT-

Web website) can be automatically uploaded and explored in

Gabmap. By combining these two online applications, macro- and

micro-analyses of dialectal data (respectively offered by Gabmap

and ALT-Web) are effectively and dynamically combined.

LORELEI Language Packs: Data, Tools, andResources for Technology Development in LowResource LanguagesStephanie Strassel and Jennifer Tracey

In this paper, we describe the textual linguistic resources in nearly

3 dozen languages being produced by Linguistic Data Consortium

for DARPA’s LORELEI (Low Resource Languages for Emergent

Incidents) Program. The goal of LORELEI is to improve the

performance of human language technologies for low-resource

languages and enable rapid re-training of such technologies for

new languages, with a focus on the use case of deployment

of resources in sudden emergencies such as natural disasters.

Representative languages have been selected to provide broad

typological coverage for training, and surprise incident languages

for testing will be selected over the course of the program. Our

approach treats the full set of language packs as a coherent

whole, maintaining LORELEI-wide specifications, tagsets, and

guidelines, while allowing for adaptation to the specific needs

created by each language. Each representative language corpus,

therefore, both stands on its own as a resource for the specific

language and forms part of a large multilingual resource for

broader cross-language technology development.

A Computational Perspective on the RomanianDialectsAlina Maria Ciobanu and Liviu P. Dinu

In this paper we conduct an initial study on the dialects of

Romanian. We analyze the differences between Romanian and its

dialects using the Swadesh list. We analyze the predictive power

of the orthographic and phonetic features of the words, building a

classification problem for dialect identification.

The Alaskan Athabascan Grammar DatabaseSebastian Nordhoff, Siri Tuttle and Olga Lovick

This paper describes a repository of example sentences in three

endangered Athabascan languages: Koyukon, Upper Tanana,

Lower Tanana. The repository allows researchers or language

teachers to browse the example sentence corpus to either

investigate the languages or to prepare teaching materials. The

originally heterogeneous text collection was imported into a

SOLR store via the POIO bridge. This paper describes the

requirements, implementation, advantages and drawbacks of this

approach and discusses the potential to apply it for other languages

of the Athabascan family or beyond.

Constraint-Based Bilingual Lexicon Induction forClosely Related LanguagesArbi Haza Nasution, Yohei Murakami and Toru Ishida

The lack or absence of parallel and comparable corpora makes

bilingual lexicon extraction becomes a difficult task for low-

resource languages. Pivot language and cognate recognition

approach have been proven useful to induce bilingual lexicons

for such languages. We analyze the features of closely related

languages and define a semantic constraint assumption. Based

on the assumption, we propose a constraint-based bilingual

lexicon induction for closely related languages by extending

constraints and translation pair candidates from recent pivot

language approach. We further define three constraint sets

based on language characteristics. In this paper, two controlled

experiments are conducted. The former involves four closely

related language pairs with different language pair similarities,

and the latter focuses on sense connectivity between non-pivot

words and pivot words. We evaluate our result with F-measure.

The result indicates that our method works better on voluminous

input dictionaries and high similarity languages. Finally, we

introduce a strategy to use proper constraint sets for different goals

and language characteristics.

P43 - Named Entity RecognitionThursday, May 26, 18:20

Chairperson: Sara Tonelli Poster Session

WTF-LOD - A New Resource for Large-ScaleNER EvaluationLubomir Otrusina and Pavel Smrz

This paper introduces the Web TextFull linkage to Linked Open

Data (WTF-LOD) dataset intended for large-scale evaluation of

114

named entity recognition (NER) systems. First, we present the

process of collecting data from the largest publically-available

textual corpora, including Wikipedia dumps, monthly runs of

the CommonCrawl, and ClueWeb09/12. We discuss similarities

and differences of related initiatives such as WikiLinks and

WikiReverse. Our work primarily focuses on links from “textfull”

documents (links surrounded by a text that provides a useful

context for entity linking), de-duplication of the data and advanced

cleaning procedures. Presented statistics demonstrate that the

collected data forms one of the largest available resource of

its kind. They also prove suitability of the result for complex

NER evaluation campaigns, including an analysis of the most

ambiguous name mentions appearing in the data.

Using a Language Technology Infrastructure forGerman in order to Anonymize German SignLanguage Corpus Data

Julian Bleicken, Thomas Hanke, Uta Salden and SvenWagner

For publishing sign language corpus data on the web,

anonymization is crucial even if it is impossible to hide the

visual appearance of the signers: In a small community, even

vague references to third persons may be enough to identify those

persons. In the case of the DGS Korpus (German Sign Language

corpus) project, we want to publish data as a contribution to the

cultural heritage of the sign language community while annotation

of the data is still ongoing. This poses the question how

well anonymization can be achieved given that no full linguistic

analysis of the data is available. Basically, we combine analysis

of all data that we have, including named entity recognition

on translations into German. For this, we use the WebLicht

language technology infrastructure. We report on the reliability

of these methods in this special context and also illustrate how the

anonymization of the video data is technically achieved in order

to minimally disturb the viewer.

Crowdsourced Corpus with Entity SalienceAnnotations

Milan Dojchinovski, Dinesh Reddy, Tomas Kliegr, TomasVitvar and Harald Sack

In this paper, we present a crowdsourced dataset which adds entity

salience (importance) annotations to the Reuters-128 dataset,

which is subset of Reuters-21578. The dataset is distributed under

a free license and publish in the NLP Interchange Format, which

fosters interoperability and re-use. We show the potential of the

dataset on the task of learning an entity salience classifier and

report on the results from several experiments.

ELMD: An Automatically Generated EntityLinking Gold Standard Dataset in the MusicDomain

Sergio Oramas, Luis Espinosa Anke, Mohamed Sordo,Horacio Saggion and Xavier Serra

In this paper we present a gold standard dataset for Entity Linking

(EL) in the Music Domain. It contains thousands of musical

named entities such as Artist, Song or Record Label, which

have been automatically annotated on a set of artist biographies

coming from the Music website and social network Last.fm. The

annotation process relies on the analysis of the hyperlinks present

in the source texts and in a voting-based algorithm for EL, which

considers, for each entity mention in text, the degree of agreement

across three state-of-the-art EL systems. Manual evaluation shows

that EL Precision is at least 94%, and due to its tunable nature,

it is possible to derive annotations favouring higher Precision or

Recall, at will. We make available the annotated dataset along

with evaluation data and the code.

Bridge-Language Capitalization Inference inWestern Iranian: Sorani, Kurmanji, Zazaki, andTajik

Patrick Littell, David R. Mortensen, Kartik Goyal, ChrisDyer and Lori Levin

In Sorani Kurdish, one of the most useful orthographic features

in named-entity recognition – capitalization – is absent, as

the language’s Perso-Arabic script does not make a distinction

between uppercase and lowercase letters. We describe a system

for deriving an inferred capitalization value from closely related

languages by phonological similarity, and illustrate the system

using several related Western Iranian languages.

Annotating Named Entities in Consumer HealthQuestions

Halil Kilicoglu, Asma Ben Abacha, Yassine Mrabet, KirkRoberts, Laritza Rodriguez, Sonya Shooshan and DinaDemner-Fushman

We describe a corpus of consumer health questions annotated

with named entities. The corpus consists of 1548 de-identified

questions about diseases and drugs, written in English. We

defined 15 broad categories of biomedical named entities for

annotation. A pilot annotation phase in which a small portion of

the corpus was double-annotated by four annotators was followed

by a main phase in which double annotation was carried out by

six annotators, and a reconciliation phase in which all annotations

115

were reconciled by an expert. We conducted the annotation in

two modes, manual and assisted, to assess the effect of automatic

pre-annotation and calculated inter-annotator agreement. We

obtained moderate inter-annotator agreement; assisted annotation

yielded slightly better agreement and fewer missed annotations

than manual annotation. Due to complex nature of biomedical

entities, we paid particular attention to nested entities for which

we obtained slightly lower inter-annotator agreement, confirming

that annotating nested entities is somewhat more challenging. To

our knowledge, the corpus is the first of its kind for consumer

health text and is publicly available.

A Regional News Corpora for ContextualizedEntity Discovery and Linking

Adrian Brasoveanu, Lyndon J.B. Nixon, AlbertWeichselbraun and Arno Scharl

This paper presents a German corpus for Named Entity Linking

(NEL) and Knowledge Base Population (KBP) tasks. We describe

the annotation guideline, the annotation process, NIL clustering

techniques and conversion to popular NEL formats such as NIF

and TAC that have been used to construct this corpus based

on news transcripts from the German regional broadcaster RBB

(Rundfunk Berlin Brandenburg). Since creating such language

resources requires significant effort, the paper also discusses how

to derive additional evaluation resources for tasks like named

entity contextualization or ontology enrichment by exploiting the

links between named entities from the annotated corpus. The

paper concludes with an evaluation that shows how several well-

known NEL tools perform on the corpus, a discussion of the

evaluation results, and with suggestions on how to keep evaluation

corpora and datasets up to date.

DBpedia Abstracts: A Large-Scale, Open,Multilingual NLP Training Corpus

Martin Brümmer, Milan Dojchinovski and SebastianHellmann

The ever increasing importance of machine learning in Natural

Language Processing is accompanied by an equally increasing

need in large-scale training and evaluation corpora. Due to its

size, its openness and relative quality, the Wikipedia has already

been a source of such data, but on a limited scale. This paper

introduces the DBpedia Abstract Corpus, a large-scale, open

corpus of annotated Wikipedia texts in six languages, featuring

over 11 million texts and over 97 million entity links. The

properties of the Wikipedia texts are being described, as well as

the corpus creation process, its format and interesting use-cases,

like Named Entity Linking training and evaluation.

Government Domain Named Entity Recognitionfor South African Languages

Roald Eiselen

This paper describes the named entity language resources

developed as part of a development project for the South African

languages. The development efforts focused on creating protocols

and annotated data sets with at least 15,000 annotated named

entity tokens for ten of the official South African languages. The

description of the protocols and annotated data sets provide an

overview of the problems encountered during the annotation of

the data sets. Based on these annotated data sets, CRF named

entity recognition systems are developed that leverage existing

linguistic resources. The newly created named entity recognisers

are evaluated, with F-scores of between 0.64 and 0.77, and error

analysis is performed to identify possible avenues for improving

the quality of the systems.

Named Entity Resources - Overview and Outlook

Maud Ehrmann, Damien Nouvel and Sophie Rosset

Recognition of real-world entities is crucial for most NLP

applications. Since its introduction some twenty years ago, named

entity processing has undergone a significant evolution with,

among others, the definition of new tasks (e.g. entity linking) and

the emergence of new types of data (e.g. speech transcriptions,

micro-blogging). These pose certainly new challenges which

affect not only methods and algorithms but especially linguistic

resources. Where do we stand with respect to named entity

resources? This paper aims at providing a systematic overview

of named entity resources, accounting for qualities such as

multilingualism, dynamicity and interoperability, and to identify

shortfalls in order to guide future developments.

Incorporating Lexico-semantic Heuristics intoCoreference Resolution Sieves for Named EntityRecognition at Document-level

Marcos Garcia

This paper explores the incorporation of lexico-semantic

heuristics into a deterministic Coreference Resolution (CR)

system for classifying named entities at document-level. The

highest precise sieves of a CR tool are enriched with both a set

of heuristics for merging named entities labeled with different

classes and also with some constraints that avoid the incorrect

merging of similar mentions. Several tests show that this strategy

improves both NER labeling and CR. The CR tool can be applied

in combination with any system for named entity recognition

using the CoNLL format, and brings benefits to text analytics tasks

116

such as Information Extraction. Experiments were carried out in

Spanish, using three different NER tools.

Using Word Embeddings to Translate NamedEntities

Octavia-Maria Sulea, Sergiu Nisioi and Liviu P. Dinu

In this paper we investigate the usefulness of neural word

embeddings in the process of translating Named Entities (NEs)

from a resource-rich language to a language low on resources

relevant to the task at hand, introducing a novel, yet simple way

of obtaining bilingual word vectors. Inspired by observations

in (Mikolov et al., 2013b), which show that training their word

vector model on comparable corpora yields comparable vector

space representations of those corpora, reducing the problem

of translating words to finding a rotation matrix, and results in

(Zou et al., 2013), which showed that bilingual word embeddings

can improve Chinese Named Entity Recognition (NER) and

English to Chinese phrase translation, we use the sentence-aligned

English-French EuroParl corpora and show that word embeddings

extracted from a merged corpus (corpus resulted from the merger

of the two aligned corpora) can be used to NE translation. We

extrapolate that word embeddings trained on merged parallel

corpora are useful in Named Entity Recognition and Translation

tasks for resource-poor languages.

O33 - Textual EntailmentThursday, May 26, 18:20

Chairperson: Lucia Specia Oral Session

TEG-REP: A corpus of Textual EntailmentGraphs based on Relation Extraction Patterns

Kathrin Eichler, Feiyu Xu, Hans Uszkoreit, LeonhardHennig and Sebastian Krause

The task of relation extraction is to recognize and extract

relations between entities or concepts in texts. Dependency parse

trees have become a popular source for discovering extraction

patterns, which encode the grammatical relations among the

phrases that jointly express relation instances. State-of-the-art

weakly supervised approaches to relation extraction typically

extract thousands of unique patterns only potentially expressing

the target relation. Among these patterns, some are semantically

equivalent, but differ in their morphological, lexical-semantic or

syntactic form. Some express a relation that entails the target

relation. We propose a new approach to structuring extraction

patterns by utilizing entailment graphs, hierarchical structures

representing entailment relations, and present a novel resource

of gold-standard entailment graphs based on a set of patterns

automatically acquired using distant supervision. We describe the

methodology used for creating the dataset and present statistics of

the resource as well as an analysis of inference types underlying

the entailment decisions.

Passing a USA National Bar Exam: a FirstCorpus for ExperimentationBiralatei Fawei, Adam Wyner and Jeff Pan

Bar exams provide a key watershed by which legal professionals

demonstrate their knowledge of the law and its application.

Passing the bar entitles one to practice the law in a given

jurisdiction. The bar provides an excellent benchmark for the

performance of legal information systems since passing the bar

would arguably signal that the system has acquired key aspects

of legal reason on a par with a human lawyer. The paper

provides a corpus and experimental results with material derived

from a real bar exam, treating the problem as a form of textual

entailment from the question to an answer. The providers of

the bar exam material set the Gold Standard, which is the

answer key. The experiments carried out using the ‘out of the

box’ the Excitement Open Platform for textual entailment. The

results and evaluation show that the tool can identify wrong

answers (non-entailment) with a high F1 score, but it performs

poorly in identifying the correct answer (entailment). The results

provide a baseline performance measure against which to evaluate

future improvements. The reasons for the poor performance

are examined, and proposals are made to augment the tool

in the future. The corpus facilitates experimentation by other

researchers.

Corpora for Learning the Mutual Relationshipbetween Semantic Relatedness and TextualEntailmentNgoc Phuoc An Vo and Octavian Popescu

In this paper we present the creation of a corpora annotated with

both semantic relatedness (SR) scores and textual entailment (TE)

judgments. In building this corpus we aimed at discovering, if any,

the relationship between these two tasks for the mutual benefit

of resolving one of them by relying on the insights gained from

the other. We considered a corpora already annotated with TE

judgments and we proceed to the manual annotation with SR

scores. The RTE 1-4 corpora used in the PASCAL competition

fit our need. The annotators worked independently of one each

other and they did not have access to the TE judgment during

annotation. The intuition that the two annotations are correlated

received major support from this experiment and this finding led

to a system that uses this information to revise the initial estimates

of SR scores. As semantic relatedness is one of the most general

and difficult task in natural language processing we expect that

future systems will combine different sources of information in

117

order to solve it. Our work suggests that textual entailment plays

a quantifiable role in addressing it.

O34 - Document Classification, Textcategorisation and Topic DetectionThursday, May 26, 18:20

Chairperson: Iryna Gurevych Oral Session

Can Topic Modelling benefit from Word SenseInformation?

Adriana Ferrugento, Hugo Gonçalo Oliveira, Ana Alvesand Filipe Rodrigues

This paper proposes a new topic model that exploits word

sense information in order to discover less redundant and more

informative topics. Word sense information is obtained from

WordNet and the discovered topics are groups of synsets, instead

of mere surface words. A key feature is that all the known senses

of a word are considered, with their probabilities. Alternative

configurations of the model are described and compared to each

other and to LDA, the most popular topic model. However, the

obtained results suggest that there are no benefits of enriching

LDA with word sense information.

Age and Gender Prediction on Health Forum Data

Prasha Shrestha, Nicolas Rey-Villamizar, Farig Sadeque,Ted Pedersen, Steven Bethard and Thamar Solorio

Health support forums have become a rich source of data that can

be used to improve health care outcomes. A user profile, including

information such as age and gender, can support targeted analysis

of forum data. But users might not always disclose their age and

gender. It is desirable then to be able to automatically extract

this information from users’ content. However, to the best of our

knowledge there is no such resource for author profiling of health

forum data. Here we present a large corpus, with close to 85,000

users, for profiling and also outline our approach and benchmark

results to automatically detect a user’s age and gender from their

forum posts. We use a mix of features from a user’s text as well as

forum specific features to obtain accuracy well above the baseline,

thus showing that both our dataset and our method are useful and

valid.

Comparing Speech and Text Classification onICNALE

Sergiu Nisioi

In this paper we explore and compare a speech and text

classification approach on a corpus of native and non-native

English speakers. We experiment on a subset of the International

Corpus Network of Asian Learners of English containing the

recorded speeches and the equivalent text transcriptions. Our

results suggest a high correlation between the spoken and

written classification results, showing that native accent is highly

correlated with grammatical structures found in text.

O35 - Detecting Information in MedicalDomainThursday, May 26, 18:20

Chairperson: Dimitrios Kokkinakis Oral Session

Monitoring Disease Outbreak Events on the WebUsing Text-mining Approach and Domain ExpertKnowledge

Elena Arsevska, Mathieu Roche, Sylvain Falala, RenaudLancelot, David Chavernac, Pascal Hendrikx and BarbaraDufour

Timeliness and precision for detection of infectious animal disease

outbreaks from the information published on the web is crucial

for prevention against their spread. We propose a generic method

to enrich and extend the use of different expressions as queries

in order to improve the acquisition of relevant disease related

pages on the web. Our method combines a text mining approach

to extract terms from corpora of relevant disease outbreak

documents, and domain expert elicitation (Delphi method) to

propose expressions and to select relevant combinations between

terms obtained with text mining. In this paper we evaluated the

performance as queries of a number of expressions obtained with

text mining and validated by a domain expert and expressions

proposed by a panel of 21 domain experts. We used African swine

fever as an infectious animal disease model. The expressions

obtained with text mining outperformed as queries the expressions

proposed by domain experts. However, domain experts proposed

expressions not extracted automatically. Our method is simple

to conduct and flexible to adapt to any other animal infectious

disease and even in the public health domain.

On Developing Resources for Patient-levelInformation Retrieval

Stephen Wu, Tamara Timmons, Amy Yates, Meikun Wang,Steven Bedrick, William Hersh and Hongfang Liu

Privacy concerns have often served as an insurmountable

barrier for the production of research and resources in clinical

information retrieval (IR). We believe that both clinical IR

research innovation and legitimate privacy concerns can be served

by the creation of intra-institutional, fully protected resources.

In this paper, we provide some principles and tools for IR

118

resource-building in the unique problem setting of patient-level

IR, following the tradition of the Cranfield paradigm.

Annotating and Detecting Medical Events inClinical Notes

Prescott Klassen, Fei Xia and Meliha Yetisgen

Early detection and treatment of diseases that onset after a patient

is admitted to a hospital, such as pneumonia, is critical to

improving and reducing costs in healthcare. Previous studies

(Tepper et al., 2013) showed that change-of-state events in

clinical notes could be important cues for phenotype detection.

In this paper, we extend the annotation schema proposed in

(Klassen et al., 2014) to mark change-of-state events, diagnosis

events, coordination, and negation. After we have completed

the annotation, we build NLP systems to automatically identify

named entities and medical events, which yield an f-score of

94.7% and 91.8%, respectively.

O36 - Speech SynthesisThursday, May 26, 18:20

Chairperson: Diana Santos Oral Session

Speech Synthesis of Code-Mixed Text

Sunayana Sitaram and Alan W Black

Most Text to Speech (TTS) systems today assume that the input

text is in a single language and is written in the same language

that the text needs to be synthesized in. However, in bilingual and

multilingual communities, code mixing or code switching occurs

in speech, in which speakers switch between languages in the

same utterance. Due to the popularity of social media, we now

see code-mixing even in text in these multilingual communities.

TTS systems capable of synthesizing such text need to be able

to handle text that is written in multiple languages and scripts.

Code-mixed text poses many challenges to TTS systems, such as

language identification, spelling normalization and pronunciation

modeling. In this work, we describe a preliminary framework

for synthesizing code-mixed text. We carry out experiments on

synthesizing code-mixed Hindi and English text. We find that

there is a significant user preference for TTS systems that can

correctly identify and pronounce words in different languages.

Chatbot Technology with Synthetic Voices in theAcquisition of an Endangered Language:Motivation, Development and Evaluation of aPlatform for Irish

Neasa Ní Chiaráin and Ailbhe Ní Chasaide

This paper describes the development and evaluation of a chatbot

platform designed for the teaching/learning of Irish. The chatbot

uses synthetic voices developed for the dialects of Irish. Speech-

enabled chatbot technology offers a potentially powerful tool for

dealing with the challenges of teaching/learning an endangered

language where learners have limited access to native speaker

models of the language and limited exposure to the language in

a truly communicative setting. The sociolinguistic context that

motivates the present development is explained. The evaluation

of the chatbot was carried out in 13 schools by 228 pupils

and consisted of two parts. Firstly, learners’ opinions of the

overall chatbot platform as a learning environment were elicited.

Secondly, learners evaluated the intelligibility, quality, and

attractiveness of the synthetic voices used in this platform. Results

were overwhelmingly positive to both the learning platform and

the synthetic voices and indicate that the time may now be ripe for

language learning applications which exploit speech and language

technologies. It is further argued that these technologies have a

particularly vital role to play in the maintenance of the endangered

language.

CHATR the Corpus; a 20-year-old archive ofConcatenative Speech Synthesis

Nick Campbell

This paper reports the preservation of an old speech synthesis

website as a corpus. CHATR was a revolutionary technique

developed in the mid nineties for concatenative speech synthesis.

The method has since become the standard for high quality speech

output by computer although much of the current research is

devoted to parametric or hybrid methods that employ smaller

amounts of data and can be more easily tunable to individual

voices. The system was first reported in 1994 and the website

was functional in 1996. The ATR labs where this system was

invented no longer exist, but the website has been preserved as

a corpus containing 1537 samples of synthesised speech from

that period (118 MB in aiff format) in 211 pages under various

finely interrelated themes The corpus can be accessed from

www.speech-data.jp as well as www.tcd-fastnet.com, where the

original code and samples are now being maintained.

119

O37 - Robots and Conversational AgentsInteractionFriday, May 27, 9:45

Chairperson: Claude Barras Oral Session

How to Address Smart Homes with a SocialRobot? A Multi-modal Corpus of UserInteractions with an Intelligent Environment

Patrick Holthaus, Christian Leichsenring, Jasmin Bernotat,Viktor Richter, Marian Pohling, Birte Carlmeyer, NormanKöster, Sebastian Meyer zu Borgsen, René Zorn, BirteSchiffhauer, Kai Frederic Engelmann, Florian Lier, SimonSchulz, Philipp Cimiano, Friederike Eyssel, ThomasHermann, Franz Kummert, David Schlangen, SvenWachsmuth, Petra Wagner, Britta Wrede and SebastianWrede

In order to explore intuitive verbal and non-verbal interfaces

in smart environments we recorded user interactions with an

intelligent apartment. Besides offering various interactive

capabilities itself, the apartment is also inhabited by a social robot

that is available as a humanoid interface. This paper presents a

multi-modal corpus that contains goal-directed actions of naive

users in attempts to solve a number of predefined tasks. Alongside

audio and video recordings, our data-set consists of large amount

of temporally aligned sensory data and system behavior provided

by the environment and its interactive components. Non-verbal

system responses such as changes in light or display contents, as

well as robot and apartment utterances and gestures serve as a

rich basis for later in-depth analysis. Manual annotations provide

further information about meta data like the current course of

study and user behavior including the incorporated modality, all

literal utterances, language features, emotional expressions, foci

of attention, and addressees.

A Corpus of Gesture-Annotated Dialogues forMonologue-to-Dialogue Generation from PersonalNarratives

Zhichao Hu, Michelle Dick, Chung-Ning Chang, KevinBowden, Michael Neff, Jean Fox Tree and Marilyn Walker

Story-telling is a fundamental and prevalent aspect of human

social behavior. In the wild, stories are told conversationally

in social settings, often as a dialogue and with accompanying

gestures and other nonverbal behavior. This paper presents

a new corpus, the Story Dialogue with Gestures (SDG)

corpus, consisting of 50 personal narratives regenerated as

dialogues, complete with annotations of gesture placement and

accompanying gesture forms. The corpus includes dialogues

generated by human annotators, gesture annotations on the human

generated dialogues, videos of story dialogues generated from this

representation, video clips of each gesture used in the gesture

annotations, and annotations of the original personal narratives

with a deep representation of story called a Story Intention Graph.

Our long term goal is the automatic generation of story co-tellings

as animated dialogues from the Story Intention Graph. We expect

this corpus to be a useful resource for researchers interested in

natural language generation, intelligent virtual agents, generation

of nonverbal behavior, and story and narrative representations.

Multimodal Resources for Human-RobotCommunication Modelling

Stavroula–Evita Fotinea, Eleni Efthimiou, MariaKoutsombogera, Athanasia-Lida Dimou, Theodore Goulasand Kyriaki Vasilaki

This paper reports on work related to the modelling of

Human-Robot Communication on the basis of multimodal and

multisensory human behaviour analysis. A primary focus

in this framework of analysis is the definition of semantics

of human actions in interaction, their capture and their

representation in terms of behavioural patterns that, in turn, feed

a multimodal human-robot communication system. Semantic

analysis encompasses both oral and sign languages, as well as

both verbal and non-verbal communicative signals to achieve an

effective, natural interaction between elderly users with slight

walking and cognitive inability and an assistive robotic platform.

A Verbal and Gestural Corpus of Story Retellingsto an Expressive Embodied Virtual Character

Jackson Tolins, Kris Liu, Michael Neff, Marilyn Walker andJean Fox Tree

We present a corpus of 44 human-agent verbal and gestural story

retellings designed to explore whether humans would gesturally

entrain to an embodied intelligent virtual agent. We used a

novel data collection method where an agent presented story

components in installments, which the human would then retell

to the agent. At the end of the installments, the human would then

retell the embodied animated agent the story as a whole. This

method was designed to allow us to observe whether changes

in the agent’s gestural behavior would result in human gestural

changes. The agent modified its gestures over the course of the

story, by starting out the first installment with gestural behaviors

designed to manifest extraversion, and slowly modifying gestures

to express introversion over time, or the reverse. The corpus

contains the verbal and gestural transcripts of the human story

retellings. The gestures were coded for type, handedness,

temporal structure, spatial extent, and the degree to which the

120

participants’ gestures match those produced by the agent. The

corpus illustrates the variation in expressive behaviors produced

by users interacting with embodied virtual characters, and the

degree to which their gestures were influenced by the agent’s

dynamic changes in personality-based expressive style.

A Multimodal Motion-Captured Corpus ofMatched and Mismatched Extravert-IntrovertConversational Pairs

Jackson Tolins, Kris Liu, Yingying Wang, Jean Fox Tree,Marilyn Walker and Michael Neff

This paper presents a new corpus, the Personality Dyads Corpus,

consisting of multimodal data for three conversations between

three personality-matched, two-person dyads (a total of 9 separate

dialogues). Participants were selected from a larger sample to

be 0.8 of a standard deviation above or below the mean on the

Big-Five Personality extraversion scale, to produce an Extravert-

Extravert dyad, an Introvert-Introvert dyad, and an Extravert-

Introvert dyad. Each pair carried out conversations for three

different tasks. The conversations were recorded using optical

motion capture for the body and data gloves for the hands. Dyads’

speech was transcribed and the gestural and postural behavior was

annotated with ANVIL. The released corpus includes personality

profiles, ANVIL files containing speech transcriptions and the

gestural annotations, and BVH files containing body and hand

motion in 3D.

O38 - CrowdsourcingFriday, May 27, 9:45

Chairperson: Andrejs Vasiljevs Oral Session

Crowdsourcing Ontology Lexicons

Bettina Lanser, Christina Unger and Philipp Cimiano

In order to make the growing amount of conceptual knowledge

available through ontologies and datasets accessible to humans,

NLP applications need access to information on how this

knowledge can be verbalized in natural language. One way to

provide this kind of information are ontology lexicons, which

apart from the actual verbalizations in a given target language can

provide further, rich linguistic information about them. Compiling

such lexicons manually is a very time-consuming task and

requires expertise both in Semantic Web technologies and lexicon

engineering, as well as a very good knowledge of the target

language at hand. In this paper we present an alternative approach

to generating ontology lexicons by means of crowdsourcing: We

use CrowdFlower to generate a small Japanese ontology lexicon

for ten exemplary ontology elements from the DBpedia ontology

according to a two-stage workflow, the main underlying idea of

which is to turn the task of generating lexicon entries into a

translation task; the starting point of this translation task is a

manually created English lexicon for DBpedia. Comparison of

the results to a manually created Japanese lexicon shows that the

presented workflow is a viable option if an English seed lexicon is

already available.

InScript: Narrative texts annotated with scriptinformationAshutosh Modi, Tatjana Anikina, Simon Ostermann andManfred Pinkal

This paper presents the InScript corpus (Narrative Texts

Instantiating Script structure). InScript is a corpus of 1,000 stories

centered around 10 different scenarios. Verbs and noun phrases

are annotated with event and participant types, respectively.

Additionally, the text is annotated with coreference information.

The corpus shows rich lexical variation and will serve as a unique

resource for the study of the role of script knowledge in natural

language processing.

A Crowdsourced Database of Event SequenceDescriptions for the Acquisition of High-qualityScript KnowledgeLilian D. A. Wanzare, Alessandra Zarcone, Stefan Thaterand Manfred Pinkal

Scripts are standardized event sequences describing typical

everyday activities, which play an important role in the

computational modeling of cognitive abilities (in particular

for natural language processing). We present a large-scale

crowdsourced collection of explicit linguistic descriptions of

script-specific event sequences (40 scenarios with 100 sequences

each). The corpus is enriched with crowdsourced alignment

annotation on a subset of the event descriptions, to be used

in future work as seed data for automatic alignment of event

descriptions (for example via clustering). The event descriptions

to be aligned were chosen among those expected to have the

strongest corrective effect on the clustering algorithm. The

alignment annotation was evaluated against a gold standard of

expert annotators. The resulting database of partially-aligned

script-event descriptions provides a sound empirical basis for

inducing high-quality script knowledge, as well as for any task

involving alignment and paraphrase detection of events.

Temporal Information Annotation: Crowd vs.ExpertsTommaso Caselli, Rachele Sprugnoli and Oana Inel

This paper describes two sets of crowdsourcing experiments on

temporal information annotation conducted on two languages,

i.e., English and Italian. The first experiment, launched on

the CrowdFlower platform, was aimed at classifying temporal

121

relations given target entities. The second one, relying on the

CrowdTruth metric, consisted in two subtasks: one devoted to

the recognition of events and temporal expressions and one to the

detection and classification of temporal relations. The outcomes

of the experiments suggest a valuable use of crowdsourcing

annotations also for a complex task like Temporal Processing.

A Tangled Web: The Faint Signals of Deception inText - Boulder Lies and Truth Corpus (BLT-C)

Franco Salvetti, John B. Lowe and James H. Martin

We present an approach to creating corpora for use in detecting

deception in text, including a discussion of the challenges peculiar

to this task. Our approach is based on soliciting several types

of reviews from writers and was implemented using Amazon

Mechanical Turk. We describe the multi-dimensional corpus

of reviews built using this approach, available free of charge

from LDC as the Boulder Lies and Truth Corpus (BLT-C).

Challenges for both corpus creation and the deception detection

include the fact that human performance on the task is typically

at chance, that the signal is faint, that paid writers such as

turkers are sometimes deceptive, and that deception is a complex

human behavior; manifestations of deception depend on details of

domain, intrinsic properties of the deceiver (such as education,

linguistic competence, and the nature of the intention), and

specifics of the deceptive act (e.g., lying vs. fabricating.) To

overcome the inherent lack of ground truth, we have developed

a set of semi-automatic techniques to ensure corpus validity. We

present some preliminary results on the task of deception detection

which suggest that the BLT-C is an improvement in the quality of

resources available for this task.

O39 - Corpora for Machine TranslationFriday, May 27, 9:45

Chairperson: Christopher Cieri Oral Session

Finding Alternative Translations in a LargeCorpus of Movie Subtitle

Jörg Tiedemann

OpenSubtitles.org provides a large collection of user contributed

subtitles in various languages for movies and TV programs.

Subtitle translations are valuable resources for cross-lingual

studies and machine translation research. A less explored feature

of the collection is the inclusion of alternative translations, which

can be very useful for training paraphrase systems or collecting

multi-reference test suites for machine translation. However,

differences in translation may also be due to misspellings,

incomplete or corrupt data files, or wrongly aligned subtitles. This

paper reports our efforts in recognising and classifying alternative

subtitle translations with language independent techniques.

We use time-based alignment with lexical re-synchronisation

techniques and BLEU score filters and sort alternative translations

into categories using edit distance metrics and heuristic rules. Our

approach produces large numbers of sentence-aligned translation

alternatives for over 50 languages provided via the OPUS corpus

collection.

Exploiting a Large Strongly Comparable CorpusThierry Etchegoyhen, Andoni Azpeitia and Naiara Pérez

This article describes a large comparable corpus for Basque

and Spanish and the methods employed to build a parallel

resource from the original data. The EITB corpus, a strongly

comparable corpus in the news domain, is to be shared with the

research community, as an aid for the development and testing

of methods in comparable corpora exploitation, and as basis for

the improvement of data-driven machine translation systems for

this language pair. Competing approaches were explored for the

alignment of comparable segments in the corpus, resulting in

the design of a simple method which outperformed a state-of-

the-art method on the corpus test sets. The method we present

is highly portable, computationally efficient, and significantly

reduces deployment work, a welcome result for the exploitation

of comparable corpora.

The United Nations Parallel Corpus v1.0Michał Ziemski, Marcin Junczys-Dowmunt and BrunoPouliquen

This paper describes the creation process and statistics of the

official United Nations Parallel Corpus, the first parallel corpus

composed from United Nations documents published by the

original data creator. The parallel corpus presented consists of

manually translated UN documents from the last 25 years (1990 to

2014) for the six official UN languages, Arabic, Chinese, English,

French, Russian, and Spanish. The corpus is freely available

for download under a liberal license. Apart from the pairwise

aligned documents, a fully aligned subcorpus for the six official

UN languages is distributed. We provide baseline BLEU scores

of our Moses-based SMT systems trained with the full data of

language pairs involving English and for all possible translation

directions of the six-way subcorpus.

WAGS: A Beautiful English-Italian BenchmarkSupporting Word Alignment Evaluation on RareWordsLuisa Bentivogli, Mauro Cettolo, M. Amin Farajian andMarcello Federico

This paper presents WAGS (Word Alignment Gold Standard),

a novel benchmark which allows extensive evaluation of WA

122

tools on out-of-vocabulary (OOV) and rare words. WAGS is

a subset of the Common Test section of the Europarl English-

Italian parallel corpus, and is specifically tailored to OOV and rare

words. WAGS is composed of 6,715 sentence pairs containing

11,958 occurrences of OOV and rare words up to frequency 15 in

the Europarl Training set (5,080 English words and 6,878 Italian

words), representing almost 3% of the whole text. Since WAGS

is focused on OOV/rare words, manual alignments are provided

for these words only, and not for the whole sentences. Two off-

the-shelf word aligners have been evaluated on WAGS, and results

have been compared to those obtained on an existing benchmark

tailored to full text alignment. The results obtained confirm that

WAGS is a valuable resource, which allows a statistically sound

evaluation of WA systems’ performance on OOV and rare words,

as well as extensive data analyses. WAGS is publicly released

under a Creative Commons Attribution license.

Manual and Automatic Paraphrases for MTEvaluation

Aleš Tamchyna and Petra Barancikova

Paraphrasing of reference translations has been shown to improve

the correlation with human judgements in automatic evaluation

of machine translation (MT) outputs. In this work, we present

a new dataset for evaluating English-Czech translation based

on automatic paraphrases. We compare this dataset with an

existing set of manually created paraphrases and find that even

automatic paraphrases can improve MT evaluation. We have

also propose and evaluate several criteria for selecting suitable

reference translations from a larger set.

O40 - Treebanks and Syntactic and SemanticAnalysisFriday, May 27, 9:45

Chairperson: Joakim Nivre Oral Session

Poly-GrETEL: Cross-Lingual Example-basedQuerying of Syntactic Constructions

Liesbeth Augustinus, Vincent Vandeghinste and TomVanallemeersch

We present Poly-GrETEL, an online tool which enables syntactic

querying in parallel treebanks, based on the monolingual GrETEL

environment. We provide online access to the Europarl parallel

treebank for Dutch and English, allowing users to query the

treebank using either an XPath expression or an example sentence

in order to look for similar constructions. We provide automatic

alignments between the nodes. By combining example-based

query functionality with node alignments, we limit the need for

users to be familiar with the query language and the structure of

the trees in the source and target language, thus facilitating the

use of parallel corpora for comparative linguistics and translation

studies.

NorGramBank: A ‘Deep’ Treebank forNorwegian

Helge Dyvik, Paul Meurer, Victoria Rosén, Koenraad DeSmedt, Petter Haugereid, Gyri Smørdal Losnegaard, GunnInger Lyse and Martha Thunes

We present NorGramBank, a treebank for Norwegian with highly

detailed LFG analyses. It is one of many treebanks made available

through the INESS treebanking infrastructure. NorGramBank

was constructed as a parsebank, i.e. by automatically parsing a

corpus, using the wide coverage grammar NorGram. One part

consisting of 350,000 words has been manually disambiguated

using computer-generated discriminants. A larger part of 50

M words has been stochastically disambiguated. The treebank

is dynamic: by global reparsing at certain intervals it is kept

compatible with the latest versions of the grammar and the

lexicon, which are continually further developed in interaction

with the annotators. A powerful query language, INESS Search,

has been developed for search across formalisms in the INESS

treebanks, including LFG c- and f-structures. Evaluation shows

that the grammar provides about 85% of randomly selected

sentences with good analyses. Agreement among the annotators

responsible for manual disambiguation is satisfactory, but also

suggests desirable simplifications of the grammar.

Accurate Deep Syntactic Parsing of Graphs: TheCase of French

Corentin Ribeyre, Eric Villemonte de la Clergerie andDjamé Seddah

Parsing predicate-argument structures in a deep syntax framework

requires graphs to be predicted. Argument structures represent

a higher level of abstraction than the syntactic ones and are

thus more difficult to predict even for highly accurate parsing

models on surfacic syntax. In this paper we investigate deep

syntax parsing, using a French data set (Ribeyre et al., 2014a).

We demonstrate that the use of topologically different types of

syntactic features, such as dependencies, tree fragments, spines

or syntactic paths, brings a much needed context to the parser.

Our higher-order parsing model, gaining thus up to 4 points,

establishes the state of the art for parsing French deep syntactic

structures.

123

Explicit Fine grained Syntactic and SemanticAnnotation of the Idafa Construction in Arabic

Abdelati Hawwari, Mohammed Attia, Mahmoud Ghoneimand Mona Diab

Idafa in traditional Arabic grammar is an umbrella construction

that covers several phenomena including what is expressed

in English as noun-noun compounds and Saxon and Norman

genitives. Additionally, Idafa participates in some other

constructions, such as quantifiers, quasi-prepositions, and

adjectives. Identifying the various types of the Idafa construction

(IC) is of importance to Natural Language processing (NLP)

applications. Noun-Noun compounds exhibit special behavior in

most languages impacting their semantic interpretation. Hence

distinguishing them could have an impact on downstream NLP

applications. The most comprehensive syntactic representation

of the Arabic language is the LDC Arabic Treebank (ATB). In

the ATB, ICs are not explicitly labeled and furthermore, there

is no distinction between ICs of noun-noun relations and other

traditional ICs. Hence, we devise a detailed syntactic and

semantic typification process of the IC phenomenon in Arabic.

We target the ATB as a platform for this classification. We render

the ATB annotated with explicit IC labels but with the further

semantic characterization which is useful for syntactic, semantic

and cross language processing. Our typification of IC comprises

3 main syntactic IC types: FIC, GIC, and TIC, and they are

further divided into 10 syntactic subclasses. The TIC group is

further classified into semantic relations. We devise a method for

automatic IC labeling and compare its yield against the CATiB

treebank. Our evaluation shows that we achieve the same level

of accuracy, but with the additional fine-grained classification into

the various syntactic and semantic types.

P44 - Corpus Creation and Querying (1)Friday, May 27, 9:45

Chairperson: Cristina Bosco Poster Session

Compasses, Magnets, Water Microscopes:Annotation of Terminology in a DiachronicCorpus of Scientific Texts

Anne-Kathrin Schumann and Stefan Fischer

The specialised lexicon belongs to the most prominent attributes

of specialised writing: Terms function as semantically dense

encodings of specialised concepts, which, in the absence of terms,

would require lengthy explanations and descriptions. In this paper,

we argue that terms are the result of diachronic processes on

both the semantic and the morpho-syntactic level. Very little

is known about these processes. We therefore present a corpus

annotation project aiming at revealing how terms are coined

and how they evolve to fit their function as semantically and

morpho-syntactically dense encodings of specialised knowledge.

The scope of this paper is two-fold: Firstly, we outline our

methodology for annotating terminology in a diachronic corpus of

scientific publications. Moreover, we provide a detailed analysis

of our annotation results and suggest methods for improving the

accuracy of annotations in a setting as difficult as ours. Secondly,

we present results of a pilot study based on the annotated terms.

The results suggest that terms in older texts are linguistically

relatively simple units that are hard to distinguish from the lexicon

of general language. We believe that this supports our hypothesis

that terminology undergoes diachronic processes of densification

and specialisation.

KorAP Architecture – Diving in the Deep Sea ofCorpus Data

Nils Diewald, Michael Hanl, Eliza Margaretha, JoachimBingel, Marc Kupietz, Piotr Banski and Andreas Witt

KorAP is a corpus search and analysis platform, developed at the

Institute for the German Language (IDS). It supports very large

corpora with multiple annotation layers, multiple query languages,

and complex licensing scenarios. KorAP’s design aims to be

scalable, flexible, and sustainable to serve the German Reference

Corpus DeReKo for at least the next decade. To meet these

requirements, we have adopted a highly modular microservice-

based architecture. This paper outlines our approach: An

architecture consisting of small components that are easy to

extend, replace, and maintain. The components include a search

backend, a user and corpus license management system, and a

web-based user frontend. We also describe a general corpus query

protocol used by all microservices for internal communications.

KorAP is open source, licensed under BSD-2, and available on

GitHub.

Text Segmentation of Digitized Clinical Texts

Cyril Grouin

In this paper, we present the experiments we made to recover

the original page layout structure into two columns from layout

damaged digitized files. We designed several CRF-based

approaches, either to identify column separator or to classify each

token from each line into left or right columns. We achieved our

best results with a model trained on homogeneous corpora (only

files composed of 2 columns) when classifying each token into left

or right columns (overall F-measure of 0.968). Our experiments

124

show it is possible to recover the original layout in columns of

digitized documents with results of quality.

A Turkish Database for Psycholinguistic StudiesBased on Frequency, Age of Acquisition, andImageability

Elif Ahsen Acar, Deniz Zeyrek, Murathan Kurfalı and CemBozsahin

This study primarily aims to build a Turkish psycholinguistic

database including three variables: word frequency, age of

acquisition (AoA), and imageability, where AoA and imageability

information are limited to nouns. We used a corpus-based

approach to obtain information about the AoA variable. We built

two corpora: a child literature corpus (CLC) including 535 books

written for 3-12 years old children, and a corpus of transcribed

children’s speech (CSC) at ages 1;4-4;8. A comparison between

the word frequencies of CLC and CSC gave positive correlation

results, suggesting the usability of the CLC to extract AoA

information. We assumed that frequent words of the CLC would

correspond to early acquired words whereas frequent words of a

corpus of adult language would correspond to late acquired words.

To validate AoA results from our corpus-based approach, a rated

AoA questionnaire was conducted on adults. Imageability values

were collected via a different questionnaire conducted on adults.

We conclude that it is possible to deduce AoA information for

high frequency words with the corpus-based approach. The results

about low frequency words were inconclusive, which is attributed

to the fact that corpus-based AoA information is affected by the

strong negative correlation between corpus frequency and rated

AoA.

Domain-Specific Corpus Expansion with FocusedWebcrawling

Steffen Remus and Chris Biemann

This work presents a straightforward method for extending or

creating in-domain web corpora by focused webcrawling. The

focused webcrawler uses statistical N-gram language models to

estimate the relatedness of documents and weblinks and needs

as input only N-grams or plain texts of a predefined domain

and seed URLs as starting points. Two experiments demonstrate

that our focused crawler is able to stay focused in domain

and language. The first experiment shows that the crawler

stays in a focused domain, the second experiment demonstrates

that language models trained on focused crawls obtain better

perplexity scores on in-domain corpora. We distribute the focused

crawler as open source software.

Corpus-Based Diacritic Restoration for SouthSlavic Languages

Nikola Ljubešic, Tomaž Erjavec and Darja Fišer

In computer-mediated communication, Latin-based scripts users

often omit diacritics when writing. Such text is typically easily

understandable to humans but very difficult for computational

processing because many words become ambiguous or unknown.

Letter-level approaches to diacritic restoration generalise better

and do not require a lot of training data but word-level approaches

tend to yield better results. However, they typically rely on a

lexicon which is an expensive resource, not covering non-standard

forms, and often not available for less-resourced languages. In

this paper we present diacritic restoration models that are trained

on easy-to-acquire corpora. We test three different types of

corpora (Wikipedia, general web, Twitter) for three South Slavic

languages (Croatian, Serbian and Slovene) and evaluate them

on two types of text: standard (Wikipedia) and non-standard

(Twitter). The proposed approach considerably outperforms

charlifter, so far the only open source tool available for this task.

We make the best performing systems freely available.

Automatic Recognition of LinguisticReplacements in Text Series Generated fromKeystroke Logs

Daniel Couto-Vale, Stella Neumann and Paula Niemietz

This paper introduces a toolkit used for the purpose of

detecting replacements of different grammatical and semantic

structures in ongoing text production logged as a chronological

series of computer interaction events (so-called keystroke logs).

The specific case we use involves human translations where

replacements can be indicative of translator behaviour that leads

to specific features of translations that distinguish them from

non-translated texts. The toolkit uses a novel CCG chart parser

customised so as to recognise grammatical words independently

of space and punctuation boundaries. On the basis of the

linguistic analysis, structures in different versions of the target

text are compared and classified as potential equivalents of the

same source text segment by ‘equivalence judges’. In that way,

replacements of grammatical and semantic structures can be

detected. Beyond the specific task at hand the approach will also

be useful for the analysis of other types of spaceless text such as

125

Twitter hashtags and texts in agglutinative or spaceless languages

like Finnish or Chinese.

Automatic Corpus Extension for Data-drivenNatural Language GenerationElena Manishina, Bassam Jabaian, Stéphane Huet andFabrice Lefevre

As data-driven approaches started to make their way into the

Natural Language Generation (NLG) domain, the need for

automation of corpus building and extension became apparent.

Corpus creation and extension in data-driven NLG domain

traditionally involved manual paraphrasing performed by either

a group of experts or with resort to crowd-sourcing. Building the

training corpora manually is a costly enterprise which requires a

lot of time and human resources. We propose to automate the

process of corpus extension by integrating automatically obtained

synonyms and paraphrases. Our methodology allowed us to

significantly increase the size of the training corpus and its level

of variability (the number of distinct tokens and specific syntactic

structures). Our extension solutions are fully automatic and

require only some initial validation. The human evaluation results

confirm that in many cases native speakers favor the outputs of the

model built on the extended corpus.

Bilbo-Val: Automatic Identification ofBibliographical Zone in PapersAmal Htait, Sebastien Fournier and Patrice Bellot

In this paper, we present the automatic annotation of

bibliographical references’ zone in papers and articles of

XML/TEI format. Our work is applied through two phases: first,

we use machine learning technology to classify bibliographical

and non-bibliographical paragraphs in papers, by means of a

model that was initially created to differentiate between the

footnotes containing or not containing bibliographical references.

The previous description is one of BILBO’s features, which is an

open source software for automatic annotation of bibliographic

reference. Also, we suggest some methods to minimize the margin

of error. Second, we propose an algorithm to find the largest list of

bibliographical references in the article. The improvement applied

on our model results an increase in the model’s efficiency with an

Accuracy equal to 85.89. And by testing our work, we are able

to achieve 72.23% as an average for the percentage of success in

detecting bibliographical references’ zone.

Large Scale Arabic Diacritized Corpus:Guidelines and FrameworkWajdi Zaghouani, Houda Bouamor, Abdelati Hawwari,Mona Diab, Ossama Obeid, Mahmoud Ghoneim, SawsanAlqahtani and Kemal Oflazer

This paper presents the annotation guidelines developed as part

of an effort to create a large scale manually diacritized corpus

for various Arabic text genres. The target size of the annotated

corpus is 2 million words. We summarize the guidelines and

describe issues encountered during the training of the annotators.

We also discuss the challenges posed by the complexity of the

Arabic language and how they are addressed. Finally, we present

the diacritization annotation procedure and detail the quality of the

resulting annotations.

P45 - Evaluation Methodologies (3)Friday, May 27, 9:45

Chairperson: Marta Villegas Poster Session

Applying the Cognitive Machine TranslationEvaluation Approach to Arabic

Irina Temnikova, Wajdi Zaghouani, Stephan Vogel andNizar Habash

The goal of the cognitive machine translation (MT) evaluation

approach is to build classifiers which assign post-editing effort

scores to new texts. The approach helps estimate fair

compensation for post-editors in the translation industry by

evaluating the cognitive difficulty of post-editing MT output.

The approach counts the number of errors classified in different

categories on the basis of how much cognitive effort they require

in order to be corrected. In this paper, we present the results

of applying an existing cognitive evaluation approach to Modern

Standard Arabic (MSA). We provide a comparison of the number

of errors and categories of errors in three MSA texts of different

MT quality (without any language-specific adaptation), as well

as a comparison between MSA texts and texts from three Indo-

European languages (Russian, Spanish, and Bulgarian), taken

from a previous experiment. The results show how the error

distributions change passing from the MSA texts of worse MT

quality to MSA texts of better MT quality, as well as a similarity

in distinguishing the texts of better MT quality for all four

languages.

A Reading Comprehension Corpus for MachineTranslation Evaluation

Carolina Scarton and Lucia Specia

Effectively assessing Natural Language Processing output tasks

is a challenge for research in the area. In the case of Machine

Translation (MT), automatic metrics are usually preferred over

human evaluation, given time and budget constraints.However,

traditional automatic metrics (such as BLEU) are not reliable for

absolute quality assessment of documents, often producing similar

scores for documents translated by the same MT system.For

126

scenarios where absolute labels are necessary for building models,

such as document-level Quality Estimation, these metrics can not

be fully trusted. In this paper, we introduce a corpus of reading

comprehension tests based on machine translated documents,

where we evaluate documents based on answers to questions by

fluent speakers of the target language. We describe the process

of creating such a resource, the experiment design and agreement

between the test takers. Finally, we discuss ways to convert the

reading comprehension test into document-level quality scores.

B2SG: a TOEFL-like Task for Portuguese

Rodrigo Wilkens, Leonardo Zilio, Eduardo Ferreira andAline Villavicencio

Resources such as WordNet are useful for NLP applications,

but their manual construction consumes time and personnel,

and frequently results in low coverage. One alternative is

the automatic construction of large resources from corpora like

distributional thesauri, containing semantically associated words.

However, as they may contain noise, there is a strong need

for automatic ways of evaluating the quality of the resulting

resource. This paper introduces a gold standard that can aid in this

task. The BabelNet-Based Semantic Gold Standard (B2SG) was

automatically constructed based on BabelNet and partly evaluated

by human judges. It consists of sets of tests that present one target

word, one related word and three unrelated words. B2SG contains

2,875 validated relations: 800 for verbs and 2,075 for nouns; these

relations are divided among synonymy, antonymy and hypernymy.

They can be used as the basis for evaluating the accuracy of the

similarity relations on distributional thesauri by comparing the

proximity of the target word with the related and unrelated options

and observing if the related word has the highest similarity value

among them. As a case study two distributional thesauri were

also developed: one using surface forms from a large (1.5 billion

word) corpus and the other using lemmatized forms from a smaller

(409 million word) corpus. Both distributional thesauri were then

evaluated against B2SG, and the one using lemmatized forms

performed slightly better.

MoBiL: A Hybrid Feature Set for AutomaticHuman Translation Quality Assessment

Yu Yuan, Serge Sharoff and Bogdan Babych

In this paper we introduce MoBiL, a hybrid Monolingual,

Bilingual and Language modelling feature set and feature

selection and evaluation framework. The set includes translation

quality indicators that can be utilized to automatically predict

the quality of human translations in terms of content adequacy

and language fluency. We compare MoBiL with the QuEst

baseline set by using them in classifiers trained with support

vector machine and relevance vector machine learning algorithms

on the same data set. We also report an experiment on feature

selection to opt for fewer but more informative features from

MoBiL. Our experiments show that classifiers trained on our

feature set perform consistently better in predicting both adequacy

and fluency than the classifiers trained on the baseline feature set.

MoBiL also performs well when used with both support vector

machine and relevance vector machine algorithms.

MARMOT: A Toolkit for Translation QualityEstimation at the Word LevelVarvara Logacheva, Chris Hokamp and Lucia Specia

We present Marmot – a new toolkit for quality estimation (QE)

of machine translation output. Marmot contains utilities targeted

at quality estimation at the word and phrase level. However,

due to its flexibility and modularity, it can also be extended to

work at the sentence level. In addition, it can be used as a

framework for extracting features and learning models for many

common natural language processing tasks. The tool has a set

of state-of-the-art features for QE, and new features can easily

be added. The tool is open-source and can be downloaded from

https://github.com/qe-team/marmot/

RankDCG: Rank-Ordering Evaluation MeasureDenys Katerenchuk and Andrew Rosenberg

Ranking is used for a wide array of problems, most notably

information retrieval (search). Kendall’s , Average Precision,

and nDCG are a few popular approaches to the evaluation of

ranking. When dealing with problems such as user ranking or

recommendation systems, all these measures suffer from various

problems, including the inability to deal with elements of the

same rank, inconsistent and ambiguous lower bound scores, and

an inappropriate cost function. We propose a new measure, a

modification of the popular nDCG algorithm, named rankDCG,

that addresses these problems. We provide a number of criteria

for any effective ranking algorithm and show that only rankDCG

satisfies them all. Results are presented on constructed and real

data sets. We release a publicly available rankDCG evaluation

package.

Spanish Word Vectors from WikipediaMathias Etcheverry and Dina Wonsever

Contents analisys from text data requires semantic representations

that are difficult to obtain automatically, as they may require large

handcrafted knowledge bases or manually annotated examples.

Unsupervised autonomous methods for generating semantic

representations are of greatest interest in face of huge volumes

of text to be exploited in all kinds of applications. In this work we

describe the generation and validation of semantic representations

in the vector space paradigm for Spanish. The method used is

127

GloVe (Pennington, 2014), one of the best performing reported

methods , and vectors were trained over Spanish Wikipedia. The

learned vectors evaluation is done in terms of word analogy

and similarity tasks (Pennington, 2014; Baroni, 2014; Mikolov,

2013a). The vector set and a Spanish version for some widely

used semantic relatedness tests are made publicly available.

P46 - Information Extraction and Retrieval(3)Friday, May 27, 9:45

Chairperson: Aurelie Neveol Poster Session

Analyzing Pre-processing Settings for UrduSingle-document Extractive Summarization

Muhammad Humayoun and Hwanjo Yu

Preprocessing is a preliminary step in many fields including IR

and NLP. The effect of basic preprocessing settings on English

for text summarization is well-studied. However, there is no

such effort found for the Urdu language (with the best of our

knowledge). In this study, we analyze the effect of basic

preprocessing settings for single-document text summarization

for Urdu, on a benchmark corpus using various experiments.

The analysis is performed using the state-of-the-art algorithms

for extractive summarization and the effect of stopword removal,

lemmatization, and stemming is analyzed. Results showed that

these pre-processing settings improve the results.

Semantic Annotation of the ACL AnthologyCorpus for the Automatic Analysis of ScientificLiterature

Kata Gábor, Haifa Zargayouna, Davide Buscaldi, IsabelleTellier and Thierry Charnois

This paper describes the process of creating a corpus annotated for

concepts and semantic relations in the scientific domain. A part

of the ACL Anthology Corpus was selected for annotation, but

the annotation process itself is not specific to the computational

linguistics domain and could be applied to any scientific corpora.

Concepts were identified and annotated fully automatically,

based on a combination of terminology extraction and available

ontological resources. A typology of semantic relations between

concepts is also proposed. This typology, consisting of 18

domain-specific and 3 generic relations, is the result of a corpus-

based investigation of the text sequences occurring between

concepts in sentences. A sample of 500 abstracts from the

corpus is currently being manually annotated with these semantic

relations. Only explicit relations are taken into account, so that

the data could serve to train or evaluate pattern-based semantic

relation classification systems.

GATE-Time: Extraction of Temporal Expressionsand Events

Leon Derczynski, Jannik Strötgen, Diana Maynard, MarkA. Greenwood and Manuel Jung

GATE is a widely used open-source solution for text processing

with a large user community. It contains components for

several natural language processing tasks. However, temporal

information extraction functionality within GATE has been rather

limited so far, despite being a prerequisite for many application

scenarios in the areas of natural language processing and

information retrieval. This paper presents an integrated approach

to temporal information processing. We take state-of-the-art

tools in temporal expression and event recognition and bring

them together to form an openly-available resource within the

GATE infrastructure. GATE-Time provides annotation in the

form of TimeML events and temporal expressions complying with

this mature ISO standard for temporal semantic annotation of

documents. Major advantages of GATE-Time are (i) that it relies

on HeidelTime for temporal tagging, so that temporal expressions

can be extracted and normalized in multiple languages and across

different domains, (ii) it includes a modern, fast event recognition

and classification tool, and (iii) that it can be combined with

different linguistic pre-processing annotations, and is thus not

bound to license restricted preprocessing components.

Distributional Thesauri for Information Retrievaland vice versa

Vincent Claveau and Ewa Kijak

Distributional thesauri are useful in many tasks of Natural

Language Processing. In this paper, we address the problem of

building and evaluating such thesauri with the help of Information

Retrieval (IR) concepts. Two main contributions are proposed.

First, following the work of [8], we show how IR tools and

concepts can be used with success to build a thesaurus. Through

several experiments and by evaluating directly the results with

reference lexicons, we show that some IR models outperform

state-of-the-art systems. Secondly, we use IR as an applicative

framework to indirectly evaluate the generated thesaurus. Here

again, this task-based evaluation validates the IR approach used

to build the thesaurus. Moreover, it allows us to compare these

results with those from the direct evaluation framework used in the

128

literature. The observed differences bring these evaluation habits

into question.

Parallel Chinese-English Entities, Relations andEvents Corpora

Justin Mott, Ann Bies, Zhiyi Song and Stephanie Strassel

This paper introduces the parallel Chinese-English Entities,

Relations and Events (ERE) corpora developed by Linguistic Data

Consortium under the DARPA Deep Exploration and Filtering of

Text (DEFT) Program. Original Chinese newswire and discussion

forum documents are annotated for two versions of the ERE task.

The texts are manually translated into English and then annotated

for the same ERE tasks on the English translation, resulting in

a rich parallel resource that has utility for performers within

the DEFT program, for participants in NIST’s Knowledge Base

Population evaluations, and for cross-language projection research

more generally.

The PsyMine Corpus - A Corpus annotated withPsychiatric Disorders and their EtiologicalFactors

Tilia Ellendorff, Simon Foster and Fabio Rinaldi

We present the first version of a corpus annotated for psychiatric

disorders and their etiological factors. The paper describes the

choice of text, annotated entities and events/relations as well

as the annotation scheme and procedure applied. The corpus

is featuring a selection of focus psychiatric disorders including

depressive disorder, anxiety disorder, obsessive-compulsive

disorder, phobic disorders and panic disorder. Etiological factors

for these focus disorders are widespread and include genetic,

physiological, sociological and environmental factors among

others. Etiological events, including annotated evidence text,

represent the interactions between their focus disorders and their

etiological factors. Additionally to these core events, symptomatic

and treatment events have been annotated. The current version

of the corpus includes 175 scientific abstracts. All entities

and events/relations have been manually annotated by domain

experts and scores of inter-annotator agreement are presented.

The aim of the corpus is to provide a first gold standard to

support the development of biomedical text mining applications

for the specific area of mental disorders which belong to the main

contributors to the contemporary burden of disease.

An Empirical Exploration of Moral FoundationsTheory in Partisan News Sources

Dean Fulgoni, Jordan Carpenter, Lyle Ungar and DanielPreotiuc-Pietro

News sources frame issues in different ways in order to appeal

or control the perception of their readers. We present a large

scale study of news articles from partisan sources in the US across

a variety of different issues. We first highlight that differences

between sides exist by predicting the political leaning of articles

of unseen political bias. Framing can be driven by different types

of morality that each group values. We emphasize differences

in framing of different news building on the moral foundations

theory quantified using hand crafted lexicons. Our results show

that partisan sources frame political issues differently both in

terms of words usage and through the moral foundations they

relate to.

Building a Dataset for Possessions Identification inText

Carmen Banea, Xi Chen and Rada Mihalcea

Just as industrialization matured from mass production to

customization and personalization, so has the Web migrated from

generic content to public disclosures of one’s most intimately held

thoughts, opinions and beliefs. This relatively new type of data is

able to represent finer and more narrowly defined demographic

slices. If until now researchers have primarily focused on

leveraging personalized content to identify latent information such

as gender, nationality, location, or age of the author, this study

seeks to establish a structured way of extracting possessions, or

items that people own or are entitled to, as a way to ultimately

provide insights into people’s behaviors and characteristics. In

order to promote more research in this area, we are releasing a set

of 798 possessions extracted from blog genre, where possessions

are marked at different confidence levels, as well as a detailed set

of guidelines to help in future annotation studies.

The Query of Everything: DevelopingOpen-Domain, Natural-Language Queries forBOLT Information Retrieval

Kira Griffitt and Stephanie Strassel

The DARPA BOLT Information Retrieval evaluations target open-

domain natural-language queries over a large corpus of informal

text in English, Chinese and Egyptian Arabic. We outline the

goals of BOLT IR, comparing it with the prior GALE Distillation

task. After discussing the properties of the BOLT IR corpus,

we provide a detailed description of the query creation process,

contrasting the summary query format presented to systems at run

time with the full query format created by annotators. We describe

the relevance criteria used to assess BOLT system responses,

highlighting the evolution of the procedures used over the three

evaluation phases. We provide a detailed review of the decision

points model for relevance assessment introduced during Phase

129

2, and conclude with information about inter-assessor consistency

achieved with the decision points assessment model.

The Validation of MRCPD Cross-languageExpansions on Imageability Ratings

Ting Liu, Kit Cho, Tomek Strzalkowski, Samira Shaikh andMehrdad Mirzaei

In this article, we present a method to validate a multi-lingual

(English, Spanish, Russian, and Farsi) corpus on imageability

ratings automatically expanded from MRCPD (Liu et al., 2014).

We employed the corpus (Brysbaert et al., 2014) on concreteness

ratings for our English MRCPD+ validation because of lacking

human assessed imageability ratings and high correlation between

concreteness ratings and imageability ratings (e.g. r = .83).

For the same reason, we built a small corpus with human

imageability assessment for the other language corpus validation.

The results show that the automatically expanded imageability

ratings are highly correlated with human assessment in all four

languages, which demonstrate our automatic expansion method

is valid and robust. We believe these new resources can be

of significant interest to the research community, particularly in

natural language processing and computational sociolinguistics.

Building Tempo-HindiWordNet: A resource foreffective temporal information access in Hindi

Dipawesh Pawar, Mohammed Hasanuzzaman and AsifEkbal

In this paper, we put forward a strategy that supplements Hindi

WordNet entries with information on the temporality of its word

senses. Each synset of Hindi WordNet is automatically annotated

to one of the five dimensions: past, present, future, neutral and

atemporal. We use semi-supervised learning strategy to build

temporal classifiers over the glosses of manually selected initial

seed synsets. The classification process is iterated based on the

repetitive confidence based expansion strategy of the initial seed

list until cross-validation accuracy drops. The resource is unique

in its nature as, to the best of our knowledge, still no such resource

is available for Hindi.

P47 - Semantic CorporaFriday, May 27, 9:45

Chairperson: Eneko Agirre Poster Session

Detection of Reformulations in Spoken French

Natalia Grabar and Iris Eshkol-Taravela

Our work addresses automatic detection of enunciations and

segments with reformulations in French spoken corpora. The

proposed approach is syntagmatic. It is based on reformulation

markers and specificities of spoken language. The reference data

are built manually and have gone through consensus. Automatic

methods, based on rules and CRF machine learning, are proposed

in order to detect the enunciations and segments that contain

reformulations. With the CRF models, different features are

exploited within a window of various sizes. Detection of

enunciations with reformulations shows up to 0.66 precision.

The tests performed for the detection of reformulated segments

indicate that the task remains difficult. The best average

performance values reach up to 0.65 F-measure, 0.75 precision,

and 0.63 recall. We have several perspectives to this work

for improving the detection of reformulated segments and for

studying the data from other points of view.

DT-Neg: Tutorial Dialogues Annotated forNegation Scope and Focus in ContextRajendra Banjade and Vasile Rus

Negation is often found more frequent in dialogue than commonly

written texts, such as literary texts. Furthermore, the scope and

focus of negation depends on context in dialogues than other

forms of texts. Existing negation datasets have focused on non-

dialogue texts such as literary texts where the scope and focus

of negation is normally present within the same sentence where

the negation is located and therefore are not the most appropriate

to inform the development of negation handling algorithms for

dialogue-based systems. In this paper, we present DT -Neg corpus

(DeepTutor Negation corpus) which contains texts extracted from

tutorial dialogues where students interacted with an Intelligent

Tutoring System (ITS) to solve conceptual physics problems. The

DT -Neg corpus contains annotated negations in student responses

with scope and focus marked based on the context of the dialogue.

Our dataset contains 1,088 instances and is available for research

purposes at http://language.memphis.edu/dt-neg.

Annotating Logical Forms for EHR QuestionsKirk Roberts and Dina Demner-Fushman

This paper discusses the creation of a semantically annotated

corpus of questions about patient data in electronic health records

(EHRs). The goal is provide the training data necessary for

semantic parsers to automatically convert EHR questions into a

structured query. A layered annotation strategy is used which

mirrors a typical natural language processing (NLP) pipeline.

First, questions are syntactically analyzed to identify multi-

part questions. Second, medical concepts are recognized and

normalized to a clinical ontology. Finally, logical forms are

created using a lambda calculus representation. We use a corpus

of 446 questions asking for patient-specific information. From

these, 468 specific questions are found containing 259 unique

medical concepts and requiring 53 unique predicates to represent

130

the logical forms. We further present detailed characteristics of the

corpus, including inter-annotator agreement results, and describe

the challenges automatic NLP systems will face on this task.

A Semantically Compositional AnnotationScheme for Time NormalizationSteven Bethard and Jonathan Parker

We present a new annotation scheme for normalizing time

expressions, such as“three days ago”, to computer-readable forms,

such as 2016-03-07. The annotation scheme addresses several

weaknesses of the existing TimeML standard, allowing the

representation of time expressions that align to more than one

calendar unit (e.g., “the past three summers”), that are defined

relative to events (e.g., “three weeks postoperative”), and that

are unions or intersections of smaller time expressions (e.g.,

“Tuesdays and Thursdays”). It achieves this by modeling time

expression interpretation as the semantic composition of temporal

operators like UNION, NEXT, and AFTER. We have applied

the annotation scheme to 34 documents so far, producing 1104

annotations, and achieving inter-annotator agreement of 0.821.

PROMETHEUS: A Corpus of ProverbsAnnotated with MetaphorsGözde Özbal, Carlo Strapparava and Serra SinemTekiroglu

Proverbs are commonly metaphoric in nature and the mapping

across domains is commonly established in proverbs. The

abundance of proverbs in terms of metaphors makes them an

extremely valuable linguistic resource since they can be utilized

as a gold standard for various metaphor related linguistic tasks

such as metaphor identification or interpretation. Besides, a

collection of proverbs fromvarious languages annotated with

metaphors would also be essential for social scientists to explore

the cultural differences betweenthose languages. In this paper,

we introduce PROMETHEUS, a dataset consisting of English

proverbs and their equivalents in Italian.In addition to the word-

level metaphor annotations for each proverb, PROMETHEUS

contains other types of information such as the metaphoricity

degree of the overall proverb, its meaning, the century that it was

first recorded in and a pair of subjective questions responded by

the annotators. To the best of our knowledge, this is the first multi-

lingual and open-domain corpus of proverbs annotated with word-

level metaphors.

Corpus Annotation within the French FrameNet:a Domain-by-domain MethodologyMarianne Djemaa, Marie Candito, Philippe Muller andLaure Vieu

This paper reports on the development of a French FrameNet,

within the ASFALDA project. While the first phase of the

project focused on the development of a French set of frames

and corresponding lexicon (Candito et al., 2014), this paper

concentrates on the subsequent corpus annotation phase, which

focused on four notional domains (commercial transactions,

cognitive stances, causality and verbal communication). Given

full coverage is not reachable for a relatively “new” FrameNet

project, we advocate that focusing on specific notional domains

allowed us to obtain full lexical coverage for the frames of

these domains, while partially reflecting word sense ambiguities.

Furthermore, as frames and roles were annotated on two French

Treebanks (the French Treebank (Abeillé and Barrier, 2004) and

the Sequoia Treebank (Candito and Seddah, 2012), we were

able to extract a syntactico-semantic lexicon from the annotated

frames. In the resource’s current status, there are 98 frames, 662

frame evoking words, 872 senses, and about 13000 annotated

frames, with their semantic roles assigned to portions of text. The

French FrameNet is freely available at alpage.inria.fr/asfalda.

Covering various Needs in Temporal Annotation:a Proposal of Extension of ISO TimeML thatPreserves Upward Compatibility

Anaïs Lefeuvre-Halftermeyer, Jean-Yves Antoine, AlainCouillault, Emmanuel Schang, Lotfi Abouda, Agata Savary,Denis Maurel, Iris Eshkol and Delphine Battistelli

This paper reports a critical analysis of the ISO TimeML standard,

in the light of several experiences of temporal annotation that were

conducted on spoken French. It shows that the norm suffers from

weaknesses that should be corrected to fit a larger variety of needs

inNLP and in corpus linguistics. We present our proposition of

some improvements of the norm before it will be revised by the

ISO Committee in 2017. These modifications concern mainly

(1) Enrichments of well identified features of the norm: temporal

function of TIMEX time expressions, additional types for TLINK

temporal relations; (2) Deeper modifications concerning the units

or features annotated: clarification between time and tense for

EVENT units, coherence of representation between temporal

signals (the SIGNAL unit) and TIMEX modifiers (the MOD

feature); (3) A recommendation to perform temporal annotation

on top of a syntactic (rather than lexical) layer (temporal

annotation on a treebank).

A General Framework for the Annotation ofCausality Based on FrameNet

Laure Vieu, Philippe Muller, Marie Candito and MarianneDjemaa

We present here a general set of semantic frames to annotate

causal expressions, with a rich lexicon in French and an annotated

corpus of about 5000 instances of causal lexical items with their

131

corresponding semantic frames. The aim of our project is to

have both the largest possible coverage of causal phenomena in

French, across all parts of speech, and have it linked to a general

semantic framework such as FN, to benefit in particular from

the relations between other semantic frames, e.g., temporal ones

or intentional ones, and the underlying upper lexical ontology

that enable some forms of reasoning. This is part of the larger

ASFALDA French FrameNet project, which focuses on a few

different notional domains which are interesting in their own

right (Djemma et al., 2016), including cognitive positions and

communication frames. In the process of building the French

lexicon and preparing the annotation of the corpus, we had to

remodel some of the frames proposed in FN based on English

data, with hopefully more precise frame definitions to facilitate

human annotation. This includes semantic clarifications of frames

and frame elements, redundancy elimination, and added coverage.

The result is arguably a significant improvement of the treatment

of causality in FN itself.

Annotating Temporally-Anchored SpatialKnowledge on Top of OntoNotes Semantic Roles

Alakananda Vempala and Eduardo Blanco

This paper presents a two-step methodology to annotate spatial

knowledge on top of OntoNotes semantic roles. First, we

manipulate semantic roles to automatically generate potential

additional spatial knowledge. Second, we crowdsource

annotations with Amazon Mechanical Turk to either validate

or discard the potential additional spatial knowledge. The

resulting annotations indicate whether entities are or are not

located somewhere with a degree of certainty, and temporally

anchor this spatial information. Crowdsourcing experiments

show that the additional spatial knowledge is ubiquitous and

intuitive to humans, and experimental results show that it can

be inferred automatically using standard supervised machine

learning techniques.

SpaceRef: A corpus of street-level geographicdescriptions

Jana Götze and Johan Boye

This article describes SPACEREF, a corpus of street-level

geographic descriptions. Pedestrians are walking a route in a

(real) urban environment, describing their actions. Their position

is automatically logged, their speech is manually transcribed, and

their references to objects are manually annotated with respect

to a crowdsourced geographic database. We describe how the

data was collected and annotated, and how it has been used

in the context of creating resources for an automatic pedestrian

navigation system.

Persian Proposition Bank

Azadeh Mirzaei and Amirsaeid Moloodi

This paper describes the procedure of semantic role labeling

and the development of the first manually annotated Persian

Proposition Bank (PerPB) which added a layer of predicate-

argument information to the syntactic structures of Persian

Dependency Treebank (known as PerDT). Through the process

of annotating, the annotators could see the syntactic information

of all the sentences and so they annotated 29982 sentences with

more than 9200 unique verbs. In the annotation procedure, the

direct syntactic dependents of the verbs were the first candidates

for being annotated. So we did not annotate the other indirect

dependents unless their phrasal heads were propositional and had

their own arguments or adjuncts. Hence besides the semantic

role labeling of verbs, the argument structure of 1300 unique

propositional nouns and 300 unique propositional adjectives were

annotated in the sentences, too. The accuracy of annotation

process was measured by double annotation of the data at two

separate stages and finally the data was prepared in the CoNLL

dependency format.

Typed Entity and Relation Annotation onComputer Science Papers

Yuka Tateisi, Tomoko Ohta, Sampo Pyysalo, Yusuke Miyaoand Akiko Aizawa

We describe our ongoing effort to establish an annotation scheme

for describing the semantic structures of research articles in the

computer science domain, with the intended use of developing

search systems that can refine their results by the roles of the

entities denoted by the query keys. In our scheme, mentions of

entities are annotated with ontology-based types, and the roles of

the entities are annotated as relations with other entities described

in the text. So far, we have annotated 400 abstracts from the ACL

anthology and the ACM digital library. In this paper, the scheme

and the annotated dataset are described, along with the problems

found in the course of annotation. We also show the results of

automatic annotation and evaluate the corpus in a practical setting

in application to topic extraction.

Enriching TimeBank: Towards a more preciseannotation of temporal relations in a text

Volker Gast, Lennart Bierkandt, Stephan Druskat andChristoph Rzymski

We propose a way of enriching the TimeML annotations of

TimeBank by adding information about the Topic Time in

132

terms of Klein (1994). The annotations are partly automatic,

partly inferential and partly manual. The corpus was converted

into the native format of the annotation software GraphAnno

and POS-tagged using the Stanford bidirectional dependency

network tagger. On top of each finite verb, a FIN-node

with tense information was created, and on top of any FIN-

node, a TOPICTIME-node, in accordance with Klein’s (1994)

treatment of finiteness as the linguistic correlate of the Topic

Time. Each TOPICTIME-node is linked to a MAKEINSTANCE-

node representing an (instantiated) event in TimeML (Pustejovsky

et al. 2005), the markup language used for the annotation

of TimeBank. For such links we introduce a new category,

ELINK. ELINKs capture the relationship between the Topic Time

(TT) and the Time of Situation (TSit) and have an aspectual

interpretation in Klein’s (1994) theory. In addition to these

automatic and inferential annotations, some TLINKs were added

manually. Using an example from the corpus, we show that the

inclusion of the Topic Time in the annotations allows for a richer

representation of the temporal structure than does TimeML. A

way of representing this structure in a diagrammatic form similar

to the T-Box format (Verhagen, 2007) is proposed.

P48 - Speech Processing (2)Friday, May 27, 9:45

Chairperson: Denise DiPersio Poster Session

How Diachronic Text Corpora Affect Contextbased Retrieval of OOV Proper Names for AudioNews

Imran Sheikh, Irina Illina and Dominique Fohr

Out-Of-Vocabulary (OOV) words missed by Large Vocabulary

Continuous Speech Recognition (LVCSR) systems can be

recovered with the help of topic and semantic context of the OOV

words captured from a diachronic text corpus. In this paper we

investigate how the choice of documents for the diachronic text

corpora affects the retrieval of OOV Proper Names (PNs) relevant

to an audio document. We first present our diachronic French

broadcast news datasets, which highlight the motivation of our

study on OOV PNs. Then the effect of using diachronic text data

from different sources and a different time span is analysed. With

OOV PN retrieval experiments on French broadcast news videos,

we conclude that a diachronic corpus with text from different

sources leads to better retrieval performance than one relying on

text from single source or from a longer time span.

Syllable based DNN-HMM Cantonese Speech toText System

Timothy Wong, Claire Li, Sam Lam, Billy Chiu, Qin Lu,Minglei Li, Dan Xiong, Roy Shing Yu and Vincent T.Y. Ng

This paper reports our work on building up a Cantonese Speech-

to-Text (STT) system with a syllable based acoustic model. This

is a part of an effort in building a STT system to aid dyslexic

students who have cognitive deficiency in writing skills but have

no problem expressing their ideas through speech. For Cantonese

speech recognition, the basic unit of acoustic models can either be

the conventional Initial-Final (IF) syllables, or the Onset-Nucleus-

Coda (ONC) syllables where finals are further split into nucleus

and coda to reflect the intra-syllable variations in Cantonese. By

using the Kaldi toolkit, our system is trained using the stochastic

gradient descent optimization model with the aid of GPUs for the

hybrid Deep Neural Network and Hidden Markov Model (DNN-

HMM) with and without I-vector based speaker adaptive training

technique. The input features of the same Gaussian Mixture

Model with speaker adaptive training (GMM-SAT) to DNN are

used in all cases. Experiments show that the ONC-based syllable

acoustic modeling with I-vector based DNN-HMM achieves the

best performance with the word error rate (WER) of 9.66% and

the real time factor (RTF) of 1.38812.

Collecting Resources in Sub-Saharan AfricanLanguages for Automatic Speech Recognition: aCase Study of Wolof

Elodie Gauthier, Laurent Besacier, Sylvie Voisin, MichaelMelese and Uriel Pascal Elingui

This article presents the data collected and ASR systems

developped for 4 sub-saharan african languages (Swahili, Hausa,

Amharic and Wolof). To illustrate our methodology, the focus is

made on Wolof (a very under-resourced language) for which we

designed the first ASR system ever built in this language. All data

and scripts are available online on our github repository.

SCALE: A Scalable Language EngineeringToolkit

Joris Pelemans, Lyan Verwimp, Kris Demuynck, Hugo Vanhamme and Patrick Wambacq

In this paper we present SCALE, a new Python toolkit that

contains two extensions to n-gram language models. The first

extension is a novel technique to model compound words called

Semantic Head Mapping (SHM). The second extension, Bag-of-

Words Language Modeling (BagLM), bundles popular models

133

such as Latent Semantic Analysis and Continuous Skip-grams.

Both extensions scale to large data and allow the integration into

first-pass ASR decoding. The toolkit is open source, includes

working examples and can be found on http://github.

com/jorispelemans/scale.

Combining Manual and Automatic ProsodicAnnotation for Expressive Speech Synthesis

Sandrine Brognaux, Thomas Francois and Marco Saerens

Text-to-speech has long been centered on the production of an

intelligible message of good quality. More recently, interest has

shifted to the generation of more natural and expressive speech.

A major issue of existing approaches is that they usually rely on

a manual annotation in expressive styles, which tends to be rather

subjective. A typical related issue is that the annotation is strongly

influenced – and possibly biased – by the semantic content of the

text (e.g. a shot or a fault may incite the annotator to tag that

sequence as expressing a high degree of excitation, independently

of its acoustic realization). This paper investigates the assumption

that human annotation of basketball commentaries in excitation

levels can be automatically improved on the basis of acoustic

features. It presents two techniques for label correction exploiting

a Gaussian mixture and a proportional-odds logistic regression.

The automatically re-annotated corpus is then used to train HMM-

based expressive speech synthesizers, the performance of which

is assessed through subjective evaluations. The results indicate

that the automatic correction of the annotation with Gaussian

mixture helps to synthesize more contrasted excitation levels,

while preserving naturalness.

BAS Speech Science Web Services - an Update ofCurrent Developments

Thomas Kisler, Uwe Reichel, Florian Schiel, ChristophDraxler, Bernhard Jackl and Nina Pörner

In 2012 the Bavarian Archive for Speech Signals started providing

some of its tools from the field of spoken language in the

form of Software as a Service (SaaS). This means users access

the processing functionality over a web browser and therefore

do not have to install complex software packages on a local

computer. Amongst others, these tools include segmentation

& labeling, grapheme-to-phoneme conversion, text alignment,

syllabification and metadata generation, where all but the last

are available for a variety of languages. Since its creation the

number of available services and the web interface have changed

considerably. We give an overview and a detailed description

of the system architecture, the available web services and their

functionality. Furthermore, we show how the number of files

processed over the system developed in the last four years.

SPA: Web-based Platform for easy Access toSpeech Processing Modules

Fernando Batista, Pedro Curto, Isabel Trancoso, AlbertoAbad, Jaime Ferreira, Eugénio Ribeiro, Helena Moniz,David Martins de Matos and Ricardo Ribeiro

This paper presents SPA, a web-based Speech Analytics platform

that integrates several speech processing modules and that makes

it possible to use them through the web. It was developed

with the aim of facilitating the usage of the modules, without

the need to know about software dependencies and specific

configurations. Apart from being accessed by a web-browser,

the platform also provides a REST API for easy integration

with other applications. The platform is flexible, scalable,

provides authentication for access restrictions, and was developed

taking into consideration the time and effort of providing new

services. The platform is still being improved, but it already

integrates a considerable number of audio and text processing

modules, including: Automatic transcription, speech disfluency

classification, emotion detection, dialog act recognition, age and

gender classification, non-nativeness detection, hyper-articulation

detection, dialog act recognition, and two external modules for

feature extraction and DTMF detection. This paper describes

the SPA architecture, presents the already integrated modules,

and provides a detailed description for the ones most recently

integrated.

Enhanced CORILGA: Introducing the AutomaticPhonetic Alignment Tool for Continuous Speech

Roberto Seara, Marta Martinez, Rocio Varela, CarmenGarcía Mateo, Elisa Fernandez Rei and Xose Luis Regueira

The “Corpus Oral Informatizado da Lingua Galega (CORILGA)”

project aims at building a corpus of oral language for Galician,

primarily designed to study the linguistic variation and change.

This project is currently under development and it is periodically

enriched with new contributions. The long-term goal is that

all the speech recordings will be enriched with phonetic,

syllabic, morphosyntactic, lexical and sentence ELAN-complaint

annotations. A way to speed up the process of annotation is

to use automatic speech-recognition-based tools tailored to the

application. Therefore, CORILGA repository has been enhanced

with an automatic alignment tool, available to the administrator

of the repository, that aligns speech with an orthographic

transcription. In the event that no transcription, or just a partial

one, were available, a speech recognizer for Galician is used to

134

generate word and phonetic segmentations. These recognized

outputs may contain errors that will have to be manually corrected

by the administrator. For assisting this task, the tool also provides

an ELAN tier with the confidence measure of each recognized

word. In this paper, after the description of the main facts of the

CORILGA corpus, the speech alignment and recognition tools are

described. Both have been developed using the Kaldi toolkit.

O41 - DiscourseFriday, May 27, 11:45

Chairperson: Justus Roux Oral Session

A Corpus of Argument Networks: Using GraphProperties to Analyse Divisive Issues

Barbara Konat, John Lawrence, Joonsuk Park, KatarzynaBudzynska and Chris Reed

Governments are increasingly utilising online platforms in order

to engage with, and ascertain the opinions of, their citizens. Whilst

policy makers could potentially benefit from such enormous

feedback from society, they first face the challenge of making

sense out of the large volumes of data produced. This creates a

demand for tools and technologies which will enable governments

to quickly and thoroughly digest the points being made and

to respond accordingly. By determining the argumentative and

dialogical structures contained within a debate, we are able

to determine the issues which are divisive and those which

attract agreement. This paper proposes a method of graph-based

analytics which uses properties of graphs representing networks

of arguments pro- & con- in order to automatically analyse issues

which divide citizens about new regulations. By future application

of the most recent advances in argument mining, the results

reported here will have a chance to scale up to enable sense-

making of the vast amount of feedback received from citizens on

directions that policy should take.

metaTED: a Corpus of Metadiscourse for SpokenLanguage

Rui Correia, Nuno Mamede, Jorge Baptista and MaxineEskenazi

This paper describes metaTED – a freely available corpus

of metadiscursive acts in spoken language collected via

crowdsourcing. Metadiscursive acts were annotated on a set

of 180 randomly chosen TED talks in English, spanning over

different speakers and topics. The taxonomy used for annotation

is composed of 16 categories, adapted from Adel(2010). This

adaptation takes into account both the material to annotate

and the setting in which the annotation task is performed.

The crowdsourcing setup is described, including considerations

regarding training and quality control. The collected data is

evaluated in terms of quantity of occurrences, inter-annotator

agreement, and annotation related measures (such as average time

on task and self-reported confidence). Results show different

levels of agreement among metadiscourse acts (α [0.15; 0.49]).

To further assess the collected material, a subset of the annotations

was submitted to expert appreciation, who validated which of

the marked occurrences truly correspond to instances of the

metadiscursive act at hand. Similarly to what happened with the

crowd, experts revealed different levels of agreement between

categories (α [0.18; 0.72]). The paper concludes with a

discussion on the applicability of metaTED with respect to each

of the 16 categories of metadiscourse.

PARC 3.0: A Corpus of Attribution Relations

Silvia Pareti

Quotation and opinion extraction, discourse and factuality have all

partly addressed the annotation and identification of Attribution

Relations. However, disjoint efforts have provided a partial and

partly inaccurate picture of attribution and generated small or

incomplete resources, thus limiting the applicability of machine

learning approaches. This paper presents PARC 3.0, a large

corpus fully annotated with Attribution Relations (ARs). The

annotation scheme was tested with an inter-annotator agreement

study showing satisfactory results for the identification of ARs and

high agreement on the selection of the text spans corresponding to

its constitutive elements: source, cue and content. The corpus,

which comprises around 20k ARs, was used to investigate the

range of structures that can express attribution. The results show a

complex and varied relation of which the literature has addressed

only a portion. PARC 3.0 is available for research use and can

be used in a range of different studies to analyse attribution and

validate assumptions as well as to develop supervised attribution

extraction models.

Improving the Annotation of Sentence Specificity

Junyi Jessy Li, Bridget O’Daniel, Yi Wu, Wenli Zhao andAni Nenkova

We introduce improved guidelines for annotation of sentence

specificity, addressing the issues encountered in prior work. Our

annotation provides judgements of sentences in context. Rather

than binary judgements, we introduce a specificity scale which

accommodates nuanced judgements. Our augmented annotation

procedure also allows us to define where in the discourse context

the lack of specificity can be resolved. In addition, the cause of the

underspecification is annotated in the form of free text questions.

135

We present results from a pilot annotation with this new scheme

and demonstrate good inter-annotator agreement. We found that

the lack of specificity distributes evenly among immediate prior

context, long distance prior context and no prior context. We find

that missing details that are not resolved in the the prior context

are more likely to trigger questions about the reason behind events,

“why” and “how”. Our data is accessible at http://www.cis.

upenn.edu/~nlp/corpora/lrec16spec.html

Focus Annotation of Task-based Data: AComparison of Expert and Crowd-SourcedAnnotation in a Reading Comprehension Corpus

Kordula De Kuthy, Ramon Ziai and Detmar Meurers

While the formal pragmatic concepts in information structure,

such as the focus of an utterance, are precisely defined in

theoretical linguistics and potentially very useful in conceptual

and practical terms, it has turned out to be difficult to reliably

annotate such notions in corpus data. We present a large-

scale focus annotation effort designed to overcome this problem.

Our annotation study is based on the tasked-based corpus

CREG, which consists of answers to explicitly given reading

comprehension questions. We compare focus annotation by

trained annotators with a crowd-sourcing setup making use of

untrained native speakers. Given the task context and an

annotation process incrementally making the question form and

answer type explicit, the trained annotators reach substantial

agreement for focus annotation. Interestingly, the crowd-sourcing

setup also supports high-quality annotation – for specific subtypes

of data. Finally, we turn to the question whether the relevance

of focus annotation can be extrinsically evaluated. We show

that automatic short-answer assessment significantly improves for

focus annotated data. The focus annotated CREG corpus is freely

available and constitutes the largest such resource for German.

O42 - Twitter Related AnalysisFriday, May 27, 11:45

Chairperson: Xavier Tannier Oral Session

Homing in on Twitter Users: Evaluating anEnhanced Geoparser for User Profile Locations

Beatrice Alex, Clare Llewellyn, Claire Grover, JonOberlander and Richard Tobin

Twitter-related studies often need to geo-locate Tweets or Twitter

users, identifying their real-world geographic locations. As tweet-

level geotagging remains rare, most prior work exploited tweet

content, timezone and network information to inform geolocation,

or else relied on off-the-shelf tools to geolocate users from

location information in their user profiles. However, such user

location metadata is not consistently structured, causing such

tools to fail regularly, especially if a string contains multiple

locations, or if locations are very fine-grained. We argue that

user profile location (UPL) and tweet location need to be treated

as distinct types of information from which differing inferences

can be drawn. Here, we apply geoparsing to UPLs, and

demonstrate how task performance can be improved by adapting

our Edinburgh Geoparser, which was originally developed for

processing English text. We present a detailed evaluation method

and results, including inter-coder agreement. We demonstrate

that the optimised geoparser can effectively extract and geo-

reference multiple locations at different levels of granularity with

an F1-score of around 0.90. We also illustrate how geoparsed

UPLs can be exploited for international information trade studies

and country-level sentiment analysis.

A Dataset for Detecting Stance in Tweets

Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani,Xiaodan Zhu and Colin Cherry

We can often detect from a person’s utterances whether he/she

is in favor of or against a given target entity (a product, topic,

another person, etc.). Here for the first time we present a dataset

of tweets annotated for whether the tweeter is in favor of or against

pre-chosen targets of interest–their stance. The targets of interest

may or may not be referred to in the tweets, and they may or

may not be the target of opinion in the tweets. The data pertains

to six targets of interest commonly known and debated in the

United States. Apart from stance, the tweets are also annotated for

whether the target of interest is the target of opinion in the tweet.

The annotations were performed by crowdsourcing. Several

techniques were employed to encourage high-quality annotations

(for example, providing clear and simple instructions) and to

identify and discard poor annotations (for example, using a small

set of check questions annotated by the authors). This Stance

Dataset, which was subsequently also annotated for sentiment,

can be used to better understand the relationship between stance,

sentiment, entity relationships, and textual inference.

Emotion Analysis on Twitter: The HiddenChallenge

Luca Dini and André Bittar

In this paper, we present an experiment to detect emotions in

tweets. Unlike much previous research, we draw the important

distinction between the tasks of emotion detection in a closed

world assumption (i.e. every tweet is emotional) and the

complicated task of identifying emotional versus non-emotional

tweets. Given an apparent lack of appropriately annotated data, we

136

created two corpora for these tasks. We describe two systems, one

symbolic and one based on machine learning, which we evaluated

on our datasets. Our evaluation shows that a machine learning

classifier performs best on emotion detection, while a symbolic

approach is better for identifying relevant (i.e. emotional) tweets.

Crowdsourcing Salient Information from Newsand Tweets

Oana Inel, Tommaso Caselli and Lora Aroyo

The increasing streams of information pose challenges to both

humans and machines. On the one hand, humans need to identify

relevant information and consume only the information that lies at

their interests. On the other hand, machines need to understand

the information that is published in online data streams and

generate concise and meaningful overviews. We consider events

as prime factors to query for information and generate meaningful

context. The focus of this paper is to acquire empirical insights

for identifying salience features in tweets and news about a target

event, i.e., the event of “whaling”. We first derive a methodology

to identify such features by building up a knowledge space of the

event enriched with relevant phrases, sentiments and ranked by

their novelty. We applied this methodology on tweets and we

have performed preliminary work towards adapting it to news

articles. Our results show that crowdsourcing text relevance,

sentiments and novelty (1) can be a main step in identifying

salient information, and (2) provides a deeper and more precise

understanding of the data at hand compared to state-of-the-art

approaches.

What does this Emoji Mean? A Vector SpaceSkip-Gram Model for Twitter Emojis

Francesco Barbieri, Francesco Ronzano and HoracioSaggion

Emojis allow us to describe objects, situations and even feelings

with small images, providing a visual and quick way to

communicate. In this paper, we analyse emojis used in Twitter

with distributional semantic models. We retrieve 10 millions

tweets posted by USA users, and we build several skip gram

word embedding models by mapping in the same vectorial space

both words and emojis. We test our models with semantic

similarity experiments, comparing the output of our models with

human assessment. We also carry out an exhaustive qualitative

evaluation, showing interesting results.

O43 - SemanticsFriday, May 27, 11:45

Chairperson: James Pustejovsky Oral Session

Crossmodal Network-Based DistributionalSemantic Models

Elias Iosif and Alexandros Potamianos

Despite the recent success of distributional semantic models

(DSMs) in various semantic tasks they remain disconnected with

real-world perceptual cues since they typically rely on linguistic

features. Text data constitute the dominant source of features

for the majority of such models, although there is evidence from

cognitive science that cues from other modalities contribute to

the acquisition and representation of semantic knowledge. In

this work, we propose the crossmodal extension of a two-tier

text-based model, where semantic representations are encoded

in the first layer, while the second layer is used for computing

similarity between words. We exploit text- and image-derived

features for performing computations at each layer, as well as

various approaches for their crossmodal fusion. It is shown

that the crossmodal model performs better (from 0.68 to 0.71

correlation coefficient) than the unimodal one for the task of

similarity computation between words.

Comprehensive and Consistent PropBank LightVerb Annotation

Claire Bonial and Martha Palmer

Recent efforts have focused on expanding the annotation coverage

of PropBank from verb relations to adjective and noun relations, as

well as light verb constructions (e.g., make an offer, take a bath).

While each new relation type has presented unique annotation

challenges, ensuring consistent and comprehensive annotation

of light verb constructions has proved particularly challenging,

given that light verb constructions are semi-productive, difficult

to define, and there are often borderline cases. This research

describes the iterative process of developing PropBank annotation

guidelines for light verb constructions, the current guidelines, and

a comparison to related resources.

Inconsistency Detection in Semantic Annotation

Nora Hollenstein, Nathan Schneider and Bonnie Webber

Inconsistencies are part of any manually annotated corpus.

Automatically finding these inconsistencies and correcting them

(even manually) can increase the quality of the data. Past research

has focused mainly on detecting inconsistency in syntactic

annotation. This work explores new approaches to detecting

inconsistency in semantic annotation. Two ranking methods

137

are presented in this paper: a discrepancy ranking and an

entropy ranking. Those methods are then tested and evaluated

on multiple corpora annotated with multiword expressions and

supersense labels. The results show considerable improvements

in detecting inconsistency candidates over a random baseline.

Possible applications of methods for inconsistency detection are

improving the annotation procedure as well as the guidelines and

correcting errors in completed annotations.

Towards Comparability of Linguistic GraphBanks for Semantic Parsing

Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, DanielZeman, Silvie Cinkova, Dan Flickinger, Jan Hajic, AngelinaIvanova and Zdenka Uresova

We announce a new language resource for research on semantic

parsing, a large, carefully curated collection of semantic

dependency graphs representing multiple linguistic traditions.

This resource is called SDP 2016 and provides an update and

extension to previous versions used as Semantic Dependency

Parsing target representations in the 2014 and 2015 Semantic

Evaluation Exercises. For a common core of English text, this

third edition comprises semantic dependency graphs from four

distinct frameworks, packaged in a unified abstract format and

aligned at the sentence and token levels. SDP 2016 is the

first general release of this resource and available for licensing

from the Linguistic Data Consortium in May 2016. The data is

accompanied by an open-source SDP utility toolkit and system

results from previous contrastive parsing evaluations against these

target representations.

Event Coreference Resolution with Multi-PassSieves

Jing Lu and Vincent Ng

Multi-pass sieve approaches have been successfully applied to

entity coreference resolution and many other tasks in natural

language processing (NLP), owing in part to the ease of designing

high-precision rules for these tasks. However, the same is not

true for event coreference resolution: typically lying towards

the end of the standard information extraction pipeline, an

event coreference resolver assumes as input the noisy outputs

of its upstream components such as the trigger identification

component and the entity coreference resolution component. The

difficulty in designing high-precision rules makes it challenging

to successfully apply a multi-pass sieve approach to event

coreference resolution. In this paper, we investigate this challenge,

proposing the first multi-pass sieve approach to event coreference

resolution. When evaluated on the version of the KBP 2015

corpus available to the participants of EN Task 2 (Event Nugget

Detection and Coreference), our approach achieves an Avg F-

score of 40.32%, outperforming the best participating system by

0.67% in Avg F-score.

O44 - Speech ResourcesFriday, May 27, 11:45

Chairperson: Sophie Rosset Oral Session

Endangered Language Documentation:Bootstrapping a Chatino Speech Corpus, ForcedAligner, ASR

Malgorzata Cavar, Damir Cavar and Hilaria Cruz

This project approaches the problem of language documentation

and revitalization from a rather untraditional angle. To improve

and facilitate language documentation of endangered languages,

we attempt to use corpus linguistic methods and speech and

language technologies to reduce the time needed for transcription

and annotation of audio and video language recordings. The paper

demonstrates this approach on the example of the endangered and

seriously under-resourced variety of Eastern Chatino (CTP). We

show how initial speech corpora can be created that can facilitate

the development of speech and language technologies for under-

resourced languages by utilizing Forced Alignment tools to time

align transcriptions. Time-aligned transcriptions can be used to

train speech corpora and utilize automatic speech recognition tools

for the transcription and annotation of untranscribed data. Speech

technologies can be used to reduce the time and effort necessary

for transcription and annotation of large collections of audio

and video recordings in digital language archives, addressing the

transcription bottleneck problem that most language archives and

many under-documented languages are confronted with. This

approach can increase the availability of language resources from

low-resourced and endangered languages to speech and language

technology research and development.

The DIRHA Portuguese Corpus: A Comparisonof Home Automation Command Detection andRecognition in Simulated and Real Data.

Miguel Matos, Alberto Abad and António Serralheiro

In this paper, we describe a new corpus -named DIRHA-

L2F RealCorpus- composed of typical home automation speech

interactions in European Portuguese that has been recorded by

the INESC-ID’s Spoken Language Systems Laboratory (L2F) to

support the activities of the Distant-speech Interaction for Robust

Home Applications (DIRHA) EU-funded project. The corpus is

a multi-microphone and multi-room database of real continuous

audio sequences containing read phonetically rich sentences, read

and spontaneous keyword activation sentences, and read and

138

spontaneous home automation commands. The background noise

conditions are controlled and randomly recreated with noises

typically found in home environments. Experimental validation

on this corpus is reported in comparison with the results obtained

on a simulated corpus using a fully automated speech processing

pipeline for two fundamental automatic speech recognition tasks

of typical ’always-listening’ home-automation scenarios: system

activation and voice command recognition. Attending to results

on both corpora, the presence of overlapping voice-like noise

is shown as the main problem: simulated sequences contain

concurrent speakers that result in general in a more challenging

corpus, while real sequences performance drops drastically when

TV or radio is on.

Accuracy of Automatic Cross-Corpus EmotionLabeling for Conversational Speech CorpusCommonization

Hiroki Mori, Atsushi Nagaoka and Yoshiko Arimoto

There exists a major incompatibility in emotion labeling

framework among emotional speech corpora, that is, category-

based and dimension-based. Commonizing these requires inter-

corpus emotion labeling according to both frameworks, but

doing this by human annotators is too costly for most cases.

This paper examines the possibility of automatic cross-corpus

emotion labeling. In order to evaluate the effectiveness of

the automatic labeling, a comprehensive emotion annotation for

two conversational corpora, UUDB and OGVC, was performed.

With a state-of-the-art machine learning technique, dimensional

and categorical emotion estimation models were trained and

tested against the two corpora. For the emotion dimension

estimation, the automatic cross-corpus emotion labeling for the

different corpus was effective for the dimensions of aroused-

sleepy, dominant-submissive and interested-indifferent, showing

only slight performance degradation against the result for the same

corpus. On the other hand, the performance for the emotion

category estimation was not sufficient.

English-to-Japanese Translation vs. Dictation vs.Post-editing: Comparing Translation Modes in aMultilingual Setting

Michael Carl, Akiko Aizawa and Masaru Yamada

Speech-enabled interfaces have the potential to become one of the

most efficient and ergonomic environments for human-computer

interaction and for text production. However, not much research

has been carried out to investigate in detail the processes and

strategies involved in the different modes of text production.

This paper introduces and evaluates a corpus of more than 55

hours of English-to-Japanese user activity data that were collected

within the ENJA15 project, in which translators were observed

while writing and speaking translations (translation dictation) and

during machine translation post-editing. The transcription of

the spoken data, keyboard logging and eye-tracking data were

recorded with Translog-II, post-processed and integrated into the

CRITT Translation Process Research-DB (TPR-DB), which is

publicly available under a creative commons license. The paper

presents the ENJA15 data as part of a large multilingual Chinese,

Danish, German, Hindi and Spanish translation process data

collection of more than 760 translation sessions. It compares the

ENJA15 data with the other language pairs and reviews some of

its particularities.

Database of Mandarin Neighborhood Statistics

Karl Neergaard, Hongzhi Xu and Chu-Ren Huang

In the design of controlled experiments with language stimuli,

researchers from psycholinguistic, neurolinguistic, and related

fields, require language resources that isolate variables known

to affect language processing. This article describes a

freely available database that provides word level statistics for

words and nonwords of Mandarin, Chinese. The featured

lexical statistics include subtitle corpus frequency, phonological

neighborhood density, neighborhood frequency, and homophone

density. The accompanying word descriptors include pinyin,

ascii phonetic transcription (sampa), lexical tone, syllable

structure, dominant PoS, and syllable, segment and pinyin

lengths for each phonological word. It is designed for

researchers particularly concerned with language processing of

isolated words and made to accommodate multiple existing

hypotheses concerning the structure of the Mandarin syllable.

The database is divided into multiple files according to the

desired search criteria: 1) the syllable segmentation schema

used to calculate density measures, and 2) whether the search is

for words or nonwords. The database is open to the research

community at https://github.com/karlneergaard/

Mandarin-Neighborhood-Statistics.

P49 - Corpus Creation and Querying (2)Friday, May 27, 11:45

Chairperson: Menzo Windhouwer Poster Session

TEITOK: Text-Faithful Annotated Corpora

Maarten Janssen

TEITOK is a web-based framework for corpus creation,

annotation, and distribution, that combines textual and linguistic

annotation within a single TEI based XML document. TEITOK

provides several built-in NLP tools to automatically (pre)process

139

texts, and is highly customizable. It features multiple orthographic

transcription layers, and a wide range of user-defined token-based

annotations. For searching, TEITOK interfaces with a local CQP

server. TEITOK can handle various types of additional resources

including Facsimile images and linked audio files, making it

possible to have a combined written/spoken corpus. It also has

additional modules for PSDX syntactic annotation and several

types of stand-off annotation.

Extracting Interlinear Glossed Text from LaTeXDocuments

Mathias Schenner and Sebastian Nordhoff

We present texigt, a command-line tool for the extraction of

structured linguistic data from LaTeX source documents, and a

language resource that has been generated using this tool: a corpus

of interlinear glossed text (IGT) extracted from open access books

published by Language Science Press. Extracted examples are

represented in a simple XML format that is easy to process and

can be used to validate certain aspects of interlinear glossed text.

The main challenge involved is the parsing of TeX and LaTeX

documents. We review why this task is impossible in general

and how the texhs Haskell library uses a layered architecture and

selective early evaluation (expansion) during lexing and parsing

in order to provide access to structured representations of LaTeX

documents at several levels. In particular, its parsing modules

generate an abstract syntax tree for LaTeX documents after

expansion of all user-defined macros and lexer-level commands

that serves as an ideal interface for the extraction of interlinear

glossed text by texigt. This architecture can easily be adapted to

extract other types of linguistic data structures from LaTeX source

documents.

Interoperability of Annotation Schemes: Usingthe Pepper Framework to Display AWADocuments in the ANNIS Interface

Talvany Carlotto, Zuhaitz Beloki, Xabier Artola and AitorSoroa

Natural language processing applications are frequently integrated

to solve complex linguistic problems, but the lack of

interoperability between these tools tends to be one of the

main issues found in that process. That is often caused by

the different linguistic formats used across the applications,

which leads to attempts to both establish standard formats to

represent linguistic information and to create conversion tools

to facilitate this integration. Pepper is an example of the latter,

as a framework that helps the conversion between different

linguistic annotation formats. In this paper, we describe the

use of Pepper to convert a corpus linguistically annotated by

the annotation scheme AWA into the relANNIS format, with the

ultimate goal of interacting with AWA documents through the

ANNIS interface. The experiment converted 40 megabytes of

AWA documents, allowed their use on the ANNIS interface, and

involved making architectural decisions during the mapping from

AWA into relANNIS using Pepper. The main issues faced during

this process were due to technical issues mainly caused by the

integration of the different systems and projects, namely AWA,

Pepper and ANNIS.

SPLIT: Smart Preprocessing (Quasi) LanguageIndependent Tool

Mohamed Al-Badrashiny, Arfath Pasha, Mona Diab, NizarHabash, Owen Rambow, Wael Salloum and Ramy Eskander

Text preprocessing is an important and necessary task for all NLP

applications. A simple variation in any preprocessing step may

drastically affect the final results. Moreover replicability and

comparability, as much as feasible, is one of the goals of our

scientific enterprise, thus building systems that can ensure the

consistency in our various pipelines would contribute significantly

to our goals. The problem has become quite pronounced with the

abundance of NLP tools becoming more and more available yet

with different levels of specifications. In this paper, we present

a dynamic unified preprocessing framework and tool, SPLIT,

that is highly configurable based on user requirements which

serves as a preprocessing tool for several tools at once. SPLIT

aims to standardize the implementations of the most important

preprocessing steps by allowing for a unified API that could

be exchanged across different researchers to ensure complete

transparency in replication. The user is able to select the required

preprocessing tasks among a long list of preprocessing steps. The

user is also able to specify the order of execution which in turn

affects the final preprocessing output.

ArchiMob - A Corpus of Spoken Swiss German

Tanja Samardzic, Yves Scherrer and Elvira Glaser

Swiss dialects of German are, unlike most dialects of well

standardised languages, widely used in everyday communication.

Despite this fact, automatic processing of Swiss German is still a

considerable challenge due to the fact that it is mostly a spoken

variety rarely recorded and that it is subject to considerable

regional variation. This paper presents a freely available general-

purpose corpus of spoken Swiss German suitable for linguistic

research, but also for training automatic tools. The corpus is

a result of a long design process, intensive manual work and

140

specially adapted computational processing. We first describe how

the documents were transcribed, segmented and aligned with the

sound source, and how inconsistent transcriptions were unified

through an additional normalisation layer. We then present a

bootstrapping approach to automatic normalisation using different

machine-translation-inspired methods. Furthermore, we evaluate

the performance of part-of-speech taggers on our data and show

how the same bootstrapping approach improves part-of-speech

tagging by 10% over four rounds. Finally, we present the

modalities of access of the corpus as well as the data format.

Word Segmentation for Akkadian Cuneiform

Timo Homburg and Christian Chiarcos

We present experiments on word segmentation for Akkadian

cuneiform, an ancient writing system and a language used for

about 3 millennia in the ancient Near East. To our best knowledge,

this is the first study of this kind applied to either the Akkadian

language or the cuneiform writing system. As a logosyllabic

writing system, cuneiform structurally resembles Eastern Asian

writing systems, so, we employ word segmentation algorithms

originally developed for Chinese and Japanese. We describe

results of rule-based algorithms, dictionary-based algorithms,

statistical and machine learning approaches. Our results may

indicate possible promising steps in cuneiform word segmentation

that can create and improve natural language processing in this

area.

Controlled Propagation of Concept Annotationsin Textual Corpora

Cyril Grouin

In this paper, we presented the annotation propagation tool

we designed to be used in conjunction with the BRAT rapid

annotation tool. We designed two experiments to annotate a

corpus of 60 files, first not using our tool, second using our

propagation tool. We evaluated the annotation time and the quality

of annotations. We shown that using the annotation propagation

tool reduces by 31.7% the time spent to annotate the corpus with

a better quality of results.

Graphical Annotation for Syntax-SemanticsMapping

Koiti Hasida

A potential work item (PWI) for ISO standard (MAP) about

linguistic annotation concerning syntax-semantics mapping is

discussed. MAP is a framework for graphical linguistic annotation

to specify a mapping (set of combinations) between possible

syntactic and semantic structures of the annotated linguistic

data. Just like a UML diagram, a MAP diagram is formal,

in the sense that it accurately specifies such a mapping. MAP

provides a diagrammatic sort of concrete syntax for linguistic

annotation far easier to understand than textual concrete syntax

such as in XML, so that it could better facilitate collaborations

among people involved in research, standardization, and practical

use of linguistic data. MAP deals with syntactic structures

including dependencies, coordinations, ellipses, transsentential

constructions, and so on. Semantic structures treated by MAP

are argument structures, scopes, coreferences, anaphora, discourse

relations, dialogue acts, and so forth. In order to simplify explicit

annotations, MAP allows partial descriptions, and assumes a few

general rules on correspondence between syntactic and semantic

compositions.

EDISON: Feature Extraction for NLP, Simplified

Mark Sammons, Christos Christodoulopoulos, ParisaKordjamshidi, Daniel Khashabi, Vivek Srikumar and DanRoth

When designing Natural Language Processing (NLP) applications

that use Machine Learning (ML) techniques, feature extraction

becomes a significant part of the development effort, whether

developing a new application or attempting to reproduce results

reported for existing NLP tasks. We present EDISON, a Java

library of feature generation functions used in a suite of state-of-

the-art NLP tools, based on a set of generic NLP data structures.

These feature extractors populate simple data structures encoding

the extracted features, which the package can also serialize to

an intuitive JSON file format that can be easily mapped to

formats used by ML packages. EDISON can also be used

programmatically with JVM-based (Java/Scala) NLP software to

provide the feature extractor input. The collection of feature

extractors is organised hierarchically and a simple search interface

is provided. In this paper we include examples that demonstrate

the versatility and ease-of-use of the EDISON feature extraction

suite to show that this can significantly reduce the time spent

by developers on feature extraction design for NLP systems.

The library is publicly hosted at https://github.com/

IllinoisCogComp/illinois-cogcomp-nlp/, and we

hope that other NLP researchers will contribute to the set of

feature extractors. In this way, the community can help simplify

reproduction of published results and the integration of ideas

from diverse sources when developing new and improved NLP

applications.

141

P50 - Document Classification and TextCategorisation (2)Friday, May 27, 11:45

Chairperson: Thierry Hamon Poster Session

MADAD: A Readability Annotation Tool forArabic Text

Nora Al-Twairesh, Abeer Al-Dayel, Hend Al-Khalifa,Maha Al-Yahya, Sinaa Alageel, Nora Abanmy and NoufAlShenaifi

This paper introduces MADAD, a general-purpose annotation

tool for Arabic text with focus on readability annotation. This

tool will help in overcoming the problem of lack of Arabic

readability training data by providing an online environment to

collect readability assessments on various kinds of corpora. Also

the tool supports a broad range of annotation tasks for various

linguistic and semantic phenomena by allowing users to create

their customized annotation schemes. MADAD is a web-based

tool, accessible through any web browser; the main features that

distinguish MADAD are its flexibility, portability, customizability

and its bilingual interface (Arabic/English).

Modeling Language Change in HistoricalCorpora: The Case of Portuguese

Marcos Zampieri, Shervin Malmasi and Mark Dras

This paper presents a number of experiments to model changes in

a historical Portuguese corpus composed of literary texts for the

purpose of temporal text classification. Algorithms were trained

to classify texts with respect to their publication date taking

into account lexical variation represented as word n-grams, and

morphosyntactic variation represented by part-of-speech (POS)

distribution. We report results of 99.8% accuracy using word

unigram features with a Support Vector Machines classifier to

predict the publication date of documents in time intervals of both

one century and half a century. A feature analysis is performed

to investigate the most informative features for this task and how

they are linked to language change.

“He Said She Said” – a Male/Female Corpus ofPolish

Filip Gralinski, Łukasz Borchmann and Piotr Wierzchon

Gender differences in language use have long been of interest

in linguistics. The task of automatic gender attribution has

been considered in computational linguistics as well. Most

research of this type is done using (usually English) texts with

authorship metadata. In this paper, we propose a new method

of male/female corpus creation based on gender-specific first-

person expressions. The method was applied on CommonCrawl

Web corpus for Polish (language, in which gender-revealing first-

person expressions are particularly frequent) to yield a large

(780M words) and varied collection of men’s and women’s texts.

The whole procedure for building the corpus and filtering out

unwanted texts is described in the present paper. The quality

check was done on a random sample of the corpus to make sure

that the majority (84%) of texts are correctly attributed, natural

texts. Some preliminary (socio)linguistic insights (websites and

words frequently occurring in male/female fragments) are given

as well.

Cohere: A Toolkit for Local CoherenceKarin Sim Smith, Wilker Aziz and Lucia Specia

We describe COHERE, our coherence toolkit which incorporates

various complementary models for capturing and measuring

different aspects of text coherence. In addition to the traditional

entity grid model (Lapata, 2005) and graph-based metric

(Guinaudeau and Strube, 2013), we provide an implementation of

a state-of-the-art syntax-based model (Louis and Nenkova, 2012),

as well as an adaptation of this model which shows significant

performance improvements in our experiments. We benchmark

these models using the standard setting for text coherence:

original documents and versions of the document with sentences

in shuffled order.

Multi-label Annotation in Scientific Articles - TheMulti-label Cancer Risk Assessment CorpusJames Ravenscroft, Anika Oellrich, Shyamasree Saha andMaria Liakata

With the constant growth of the scientific literature, automated

processes to enable access to its contents are increasingly

in demand. Several functional discourse annotation schemes

have been proposed to facilitate information extraction and

summarisation from scientific articles, the most well known being

argumentative zoning. Core Scientific concepts (CoreSC) is a

three layered fine-grained annotation scheme providing content-

based annotations at the sentence level and has been used

to index, extract and summarise scientific publications in the

biomedical literature. A previously developed CoreSC corpus

on which existing automated tools have been trained contains

a single annotation for each sentence. However, it is the case

that more than one CoreSC concept can appear in the same

sentence. Here, we present the Multi-CoreSC CRA corpus, a

text corpus specific to the domain of cancer risk assessment

(CRA), consisting of 50 full text papers, each of which contains

sentences annotated with one or more CoreSCs. The full text

papers have been annotated by three biology experts. We

present several inter-annotator agreement measures appropriate

142

for multi-label annotation assessment. Employing several inter-

annotator agreement measures, we were able to identify the most

reliable annotator and we built a harmonised consensus (gold

standard) from the three different annotators, while also taking

concept priority (as specified in the guidelines) into account. We

also show that the new Multi-CoreSC CRA corpus allows us

to improve performance in the recognition of CoreSCs. The

updated guidelines, the multi-label CoreSC CRA corpus and other

relevant, related materials are available at the time of publication

at http://www.sapientaproject.com/.

Detecting Expressions of Blame or Praise in Text

Udochukwu Orizu and Yulan He

The growth of social networking platforms has drawn a lot of

attentions to the need for social computing. Social computing

utilises human insights for computational tasks as well as design

of systems that support social behaviours and interactions. One

of the key aspects of social computing is the ability to attribute

responsibility such as blame or praise to social events. This

ability helps an intelligent entity account and understand other

intelligent entities’ social behaviours, and enriches both the social

functionalities and cognitive aspects of intelligent agents. In

this paper, we present an approach with a model for blame and

praise detection in text. We build our model based on various

theories of blame and include in our model features used by

humans determining judgment such as moral agent causality,

foreknowledge, intentionality and coercion. An annotated corpus

has been created for the task of blame and praise detection from

text. The experimental results show that while our model gives

similar results compared to supervised classifiers on classifying

text as blame, praise or others, it outperforms supervised

classifiers on more finer-grained classification of determining the

direction of blame and praise, i.e., self-blame, blame-others, self-

praise or praise-others, despite not using labelled training data.

Evaluating Unsupervised Dutch WordEmbeddings as a Linguistic Resource

Stephan Tulkens, Chris Emmery and Walter Daelemans

Word embeddings have recently seen a strong increase in interest

as a result of strong performance gains on a variety of tasks.

However, most of this research also underlined the importance of

benchmark datasets, and the difficulty of constructing these for

a variety of language-specific tasks. Still, many of the datasets

used in these tasks could prove to be fruitful linguistic resources,

allowing for unique observations into language use and variability.

In this paper we demonstrate the performance of multiple types

of embeddings, created with both count and prediction-based

architectures on a variety of corpora, in two language-specific

tasks: relation evaluation, and dialect identification. For the

latter, we compare unsupervised methods with a traditional,

hand-crafted dictionary. With this research, we provide the

embeddings themselves, the relation evaluation task benchmark

for use in further research, and demonstrate how the benchmarked

embeddings prove a useful unsupervised linguistic resource,

effectively used in a downstream task.

SatiricLR: a Language Resource of Satirical NewsArticles

Alice Frain and Sander Wubben

In this paper we introduce the Satirical Language Resource:

a dataset containing a balanced collection of satirical and non

satirical news texts from various domains. This is the first dataset

of this magnitude and scope in the domain of satire. We envision

this dataset will facilitate studies on various aspects of of sat- ire

in news articles. We test the viability of our data on the task of

classification of satire.

P51 - Multilingual CorporaFriday, May 27, 11:45

Chairperson: Penny Labropoulou Poster Session

WIKIPARQ: A Tabulated Wikipedia ResourceUsing the Parquet Format

Marcus Klang and Pierre Nugues

Wikipedia has become one of the most popular resources in

natural language processing and it is used in quantities of

applications. However, Wikipedia requires a substantial pre-

processing step before it can be used. For instance, its set of

nonstandardized annotations, referred to as the wiki markup, is

language-dependent and needs specific parsers from language

to language, for English, French, Italian, etc. In addition, the

intricacies of the different Wikipedia resources: main article

text, categories, wikidata, infoboxes, scattered into the article

document or in different files make it difficult to have global view

of this outstanding resource. In this paper, we describe WikiParq,

a unified format based on the Parquet standard to tabulate and

package the Wikipedia corpora. In combination with Spark, a

map-reduce computing framework, and the SQL query language,

WikiParq makes it much easier to write database queries to extract

specific information or subcorpora from Wikipedia, such as all

the first paragraphs of the articles in French, or all the articles

on persons in Spanish, or all the articles on persons that have

versions in French, English, and Spanish. WikiParq is available

in six language versions and is potentially extendible to all the

languages of Wikipedia. The WikiParq files are downloadable as

143

tarball archives from this location: http://semantica.cs.

lth.se/wikiparq/.

EN-ES-CS: An English-Spanish Code-SwitchingTwitter Corpus for Multilingual SentimentAnalysis

David Vilares, Miguel A. Alonso and Carlos Gómez-Rodríguez

Code-switching texts are those that contain terms in two or

more different languages, and they appear increasingly often in

social media. The aim of this paper is to provide a resource

to the research community to evaluate the performance of

sentiment classification techniques on this complex multilingual

environment, proposing an English-Spanish corpus of tweets with

code-switching (EN-ES-CS CORPUS). The tweets are labeled

according to two well-known criteria used for this purpose:

SentiStrength and a trinary scale (positive, neutral and negative

categories). Preliminary work on the resource is already done,

providing a set of baselines for the research community.

SemRelData – Multilingual ContextualAnnotation of Semantic Relations betweenNominals: Dataset and Guidelines

Darina Benikova and Chris Biemann

Semantic relations play an important role in linguistic knowledge

representation. Although their role is relevant in the context of

written text, there is no approach or dataset that makes use of

contextuality of classic semantic relations beyond the boundary

of one sentence. We present the SemRelData dataset that contains

annotations of semantic relations between nominals in the context

of one paragraph. To be able to analyse the universality of this

context notion, the annotation was performed on a multi-lingual

and multi-genre corpus. To evaluate the dataset, it is compared

to large, manually created knowledge resources in the respective

languages. The comparison shows that knowledge bases not

only have coverage gaps; they also do not account for semantic

relations that are manifested in particular contexts only, yet still

play an important role for text cohesion.

A Multilingual, Multi-style and Multi-granularityDataset for Cross-language Textual SimilarityDetection

Jérémy Ferrero, Frédéric Agnès, Laurent Besacier andDidier Schwab

In this paper we describe our effort to create a dataset for

the evaluation of cross-language textual similarity detection.

We present preexisting corpora and their limits and we

explain the various gathered resources to overcome these limits

and build our enriched dataset. The proposed dataset is

multilingual, includes cross-language alignment for different

granularities (from chunk to document), is based on both

parallel and comparable corpora and contains human and

machine translated texts. Moreover, it includes texts written

by multiple types of authors (from average to professionals).

With the obtained dataset, we conduct a systematic and

rigorous evaluation of several state-of-the-art cross-language

textual similarity detection methods. The evaluation results

are reviewed and discussed. Finally, dataset and scripts are

made publicly available on GitHub: http://github.com/

FerreroJeremy/Cross-Language-Dataset.

An Arabic-Moroccan Darija Code-SwitchedCorpusYounes Samih and Wolfgang Maier

In this paper, we describe our effort in the development and

annotation of a large scale corpus containing code-switched data.

Until recently, very limited effort has been devoted to develop

computational approaches or even basic linguistic resources to

support research into the processing of Moroccan Darija.

Standard Test Collection for English-PersianCross-Lingual Word Sense DisambiguationNavid Rekabsaz, Serwah Sabetghadam, Mihai Lupu, LindaAndersson and Allan Hanbury

In this paper, we address the shortage of evaluation benchmarks

on Persian (Farsi) language by creating and making available a

new benchmark for English to Persian Cross Lingual Word Sense

Disambiguation (CL-WSD). In creating the benchmark, we follow

the format of the SemEval 2013 CL-WSD task, such that the

introduced tools of the task can also be applied on the benchmark.

In fact, the new benchmark extends the SemEval-2013 CL-WSD

task to Persian language.

FREME: Multilingual Semantic Enrichment withLinked Data and Language TechnologiesMilan Dojchinovski, Felix Sasaki, Tatjana Gornostaja,Sebastian Hellmann, Erik Mannens, Frank Salliau, MicheleOsella, Phil Ritchie, Giannis Stoitsis, Kevin Koidl, MarkusAckermann and Nilesh Chakraborty

In the recent years, Linked Data and Language Technology

solutions gained popularity. Nevertheless, their coupling in real-

world business is limited due to several issues. Existing products

and services are developed for a particular domain, can be used

only in combination with already integrated datasets or their

language coverage is limited. In this paper, we present an

innovative solution FREME - an open framework of e-Services

for multilingual and semantic enrichment of digital content. The

framework integrates six interoperable e-Services. We describe

the core features of each e-Service and illustrate their usage in

144

the context of four business cases: i) authoring and publishing; ii)

translation and localisation; iii) cross-lingual access to data; and

iv) personalised Web content recommendations. Business cases

drive the design and development of the framework.

Improving Bilingual Terminology Extraction fromComparable Corpora via Multiple Word-SpaceModels

Amir Hazem and Emmanuel Morin

There is a rich flora of word space models that have proven their

efficiency in many different applications including information

retrieval (Dumais, 1988), word sense disambiguation (Schutze,

1992), various semantic knowledge tests (lund, 1995; Karlgren,

2001), and text categorization (Sahlgren, 2005). Based on

the assumption that each model captures some aspects of word

meanings and provides its own empirical evidence, we present

in this paper a systematic exploration of the principal corpus-

based word space models for bilingual terminology extraction

from comparable corpora. We find that, once we have identified

the best procedures, a very simple combination approach leads to

significant improvements compared to individual models.

MultiVec: a Multilingual and MultilevelRepresentation Learning Toolkit for NLP

Alexandre Berard, Christophe Servan, Olivier Pietquin andLaurent Besacier

We present MultiVec, a new toolkit for computing continuous

representations for text at different granularity levels (word-level

or sequences of words). MultiVec includes word2vec’s features,

paragraph vector (batch and online) and bivec for bilingual

distributed representations. MultiVec also includes different

distance measures between words and sequences of words. The

toolkit is written in C++ and is aimed at being fast (in the

same order of magnitude as word2vec), easy to use, and easy to

extend. It has been evaluated on several NLP tasks: the analogical

reasoning task, sentiment analysis, and crosslingual document

classification.

Creation of comparable corpora forEnglish-Urdu, Arabic, Persian

Murad Abouammoh, Kashif Shah and Ahmet Aker

Statistical Machine Translation (SMT) relies on the availability of

rich parallel corpora. However, in the case of under-resourced

languages or some specific domains, parallel corpora are not

readily available. This leads to under-performing machine

translation systems in those sparse data settings. To overcome

the low availability of parallel resources the machine translation

community has recognized the potential of using comparable

resources as training data. However, most efforts have been

related to European languages and less in middle-east languages.

In this study, we report comparable corpora created from news

articles for the pair English –Arabic, Persian, Urdu languages.

The data has been collected over a period of a year, entails Arabic,

Persian and Urdu languages. Furthermore using the English as

a pivot language, comparable corpora that involve more than one

language can be created, e.g. English- Arabic - Persian, English

- Arabic - Urdu, English – Urdu - Persian, etc. Upon request the

data can be provided for research purposes.

A Corpus of Native, Non-native and TranslatedTexts

Sergiu Nisioi, Ella Rabinovich, Liviu P. Dinu and ShulyWintner

We describe a monolingual English corpus of original and

(human) translated texts, with an accurate annotation of speaker

properties, including the original language of the utterances and

the speaker’s country of origin. We thus obtain three sub-

corpora of texts reflecting native English, non-native English,

and English translated from a variety of European languages.

This dataset will facilitate the investigation of similarities and

differences between these kinds of sub-languages. Moreover,

it will facilitate a unified comparative study of translations and

language produced by (highly fluent) non-native speakers, two

closely-related phenomena that have only been studied in isolation

so far.

Orthographic and MorphologicalCorrespondences between Related SlavicLanguages as a Base for Modeling of MutualIntelligibility

Andrea Fischer, Klara Jagrova, Irina Stenger, TaniaAvgustinova, Dietrich Klakow and Roland Marti

In an intercomprehension scenario, typically a native speaker

of language L1 is confronted with output from an unknown,

but related language L2. In this setting, the degree to which

the receiver recognizes the unfamiliar words greatly determines

communicative success. Despite exhibiting great string-level

differences, cognates may be recognized very successfully if

the receiver is aware of regular correspondences which allow to

transform the unknown word into its familiar form. Modeling

L1-L2 intercomprehension then requires the identification of all

the regular correspondences between languages L1 and L2. We

here present a set of linguistic orthographic correspondences

manually compiled from comparative linguistics literature along

with a set of statistically-inferred suggestions for correspondence

rules. In order to do statistical inference, we followed the

145

Minimum Description Length principle, which proposes to

choose those rules which are most effective at describing the

data. Our statistical model was able to reproduce most of our

linguistic correspondences (88.5% for Czech-Polish and 75.7%

for Bulgarian-Russian) and furthermore allowed to easily identify

many more non-trivial correspondences which also cover aspects

of morphology.

Axolotl: a Web Accessible Parallel Corpus forSpanish-Nahuatl

Ximena Gutierrez-Vasques, Gerardo Sierra and IsaacHernandez Pompa

This paper describes the project called Axolotl which comprises a

Spanish-Nahuatl parallel corpus and its search interface. Spanish

and Nahuatl are distant languages spoken in the same country.

Due to the scarcity of digital resources, we describe the several

problems that arose when compiling this corpus: most of our

sources were non-digital books, we faced errors when digitizing

the sources and there were difficulties in the sentence alignment

process, just to mention some. The documents of the parallel

corpus are not homogeneous, they were extracted from different

sources, there is dialectal, diachronical, and orthographical

variation. Additionally, we present a web search interface that

allows to make queries through the whole parallel corpus, the

system is capable to retrieve the parallel fragments that contain

a word or phrase searched by a user in any of the languages. To

our knowledge, this is the first Spanish-Nahuatl public available

digital parallel corpus. We think that this resource can be useful

to develop language technologies and linguistic studies for this

language pair.

A Turkish-German Code-Switching Corpus

Özlem Çetinoglu

Bilingual communities often alternate between languages both

in spoken and written communication. One such community,

Germany residents of Turkish origin produce Turkish-German

code-switching, by heavily mixing two languages at discourse,

sentence, or word level. Code-switching in general, and Turkish-

German code-switching in particular, has been studied for a long

time from a linguistic perspective. Yet resources to study them

from a more computational perspective are limited due to either

small size or licence issues. In this work we contribute the solution

of this problem with a corpus. We present a Turkish-German code-

switching corpus which consists of 1029 tweets, with a majority

of intra-sentential switches. We share different type of code-

switching we have observed in our collection and describe our

processing steps. The first step is data collection and filtering.

This is followed by manual tokenisation and normalisation. And

finally, we annotate data with word-level language identification

information. The resulting corpus is available for research

purposes.

Introducing the LCC Metaphor Datasets

Michael Mohler, Mary Brunson, Bryan Rink and MarcTomlinson

In this work, we present the Language Computer Corporation

(LCC) annotated metaphor datasets, which represent the largest

and most comprehensive resource for metaphor research to date.

These datasets were produced over the course of three years by

a staff of nine annotators working in four languages (English,

Spanish, Russian, and Farsi). As part of these datasets, we

provide (1) metaphoricity ratings for within-sentence word pairs

on a four-point scale, (2) scored links to our repository of 114

source concept domains and 32 target concept domains, and

(3) ratings for the affective polarity and intensity of each pair.

Altogether, we provide 188,741 annotations in English (for 80,100

pairs), 159,915 annotations in Spanish (for 63,188 pairs), 99,740

annotations in Russian (for 44,632 pairs), and 137,186 annotations

in Farsi (for 57,239 pairs). In addition, we are providing a large

set of likely metaphors which have been independently extracted

by our two state-of-the-art metaphor detection systems but which

have not been analyzed by our team of annotators.

Creating a Large Multi-Layered RepresentationalRepository of Linguistic Code Switched ArabicData

Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari,Fahad AlGhamdi, Nada AlMarwani and Mohamed Al-Badrashiny

We present our effort to create a large Multi-Layered

representational repository of Linguistic Code-Switched Arabic

data. The process involves developing clear annotation

standards and Guidelines, streamlining the annotation process,

and implementing quality control measures. We used two main

protocols for annotation: in-lab gold annotations and crowd

sourcing annotations. We developed a web-based annotation tool

to facilitate the management of the annotation process. The

current version of the repository contains a total of 886,252 tokens

that are tagged into one of sixteen code-switching tags. The data

exhibits code switching between Modern Standard Arabic and

Egyptian Dialectal Arabic representing three data genres: Tweets,

commentaries, and discussion fora. The overall Inter-Annotator

Agreement is 93.1%.

146

Modelling a Parallel Corpus of French andFrench Belgian Sign Language

Laurence Meurant, Maxime Gobert and Anthony Cleve

The overarching objective underlying this research is to develop

an online tool, based on a parallel corpus of French Belgian

Sign Language (LSFB) and written Belgian French. This tool

is aimed to assist various set of tasks related to the comparison

of LSFB and French, to the benefit of general users as well as

teachers in bilingual schools, translators and interpreters, as well

as linguists. These tasks include (1) the comprehension of LSFB

or French texts, (2) the production of LSFB or French texts, (3)

the translation between LSFB and French in both directions and

(4) the contrastive analysis of these languages. The first step

of investigation aims at creating an unidirectional French-LSFB

concordancer, able to align a one- or multiple-word expression

from the French translated text with its corresponding expressions

in the videotaped LSFB productions. We aim at testing the

efficiency of this concordancer for the extraction of a dictionary of

meanings in context. In this paper, we will present the modelling

of the different data sources at our disposal and specifically the

way they interact with one another.

Building the Macedonian-Croatian ParallelCorpus

Ines Cebovic and Marko Tadic

In this paper we present the newly created parallel corpus of

two under-resourced languages, namely, Macedonian-Croatian

Parallel Corpus (mk-hr_ pcorp) that has been collected during

2015 at the Faculty of Humanities and Social Sciences, University

of Zagreb. The mk-hr_ pcorp is a unidirectional (mkightarrowhr)

parallel corpus composed of synchronic fictional prose texts

received already in digital form with over 500 Kw in each

language. The corpus was sentence segmented and provides

39,735 aligned sentences. The alignment was done automatically

and then post-corrected manually. The alignments order was

shuffled and this enabled the corpus to be available under CC-

BY license through META-SHARE. However, this prevents the

research in language units over the sentence level.

Two Years of Aranea: Increasing Counts andTuning the Pipeline

Vladimír Benko

The Aranea Project is targeted at creation of a family of Gigaword

web-corpora for a dozen of languages that could be used for

teaching language- and linguistics-related subjects at Slovak

universities, as well as for research purposes in various areas of

linguistics. All corpora are being built according to a standard

methodology and using the same set of tools for processing

and annotation, which – together with their standard size and–

makes them also a valuable resource for translators and contrastive

studies. All our corpora are freely available either via a web

interface or in a source form in an annotated vertical format.

Quantitative Analysis of Gazes and GroundingActs in L1 and L2 Conversations

Ichiro Umata, Koki Ijuin, Mitsuru Ishida, Moe Takeuchiand Seiichi Yamamoto

The listener’s gazing activities during utterances were analyzed

in a face-to-face three-party conversation setting. The function

of each utterance was categorized according to the Grounding

Acts defined by Traum (Traum, 1994) so that gazes during

utterances could be analyzed from the viewpoint of grounding

in communication (Clark, 1996). Quantitative analysis showed

that the listeners were gazing at the speakers more in the second

language (L2) conversation than in the native language (L1)

conversation during the utterances that added new pieces of

information, suggesting that they are using visual information

to compensate for their lack of linguistic proficiency in L2

conversation.

Multi-language Speech Collection for NIST LRE

Karen Jones, Stephanie Strassel, Kevin Walker, David Graffand Jonathan Wright

The Multi-language Speech (MLS) Corpus supports NIST’s

Language Recognition Evaluation series by providing new

conversational telephone speech and broadcast narrowband data

in 20 languages/dialects. The corpus was built with the intention

of testing system performance in the matter of distinguishing

closely related or confusable linguistic varieties, and careful

manual auditing of collected data was an important aspect of

this work. This paper lists the specific data requirements for

the collection and provides both a commentary on the rationale

for those requirements as well as an outline of the various steps

taken to ensure all goals were met as specified. LDC conducted

a large-scale recruitment effort involving the implementation of

candidate assessment and interview techniques suitable for hiring

a large contingent of telecommuting workers, and this recruitment

effort is discussed in detail. We also describe the telephone

and broadcast collection infrastructure and protocols, and provide

details of the steps taken to pre-process collected data prior to

auditing. Finally, annotation training, procedures and outcomes

are presented in detail.

147

P52 - Part of Speech Tagging (2)Friday, May 27, 11:45

Chairperson: Piotr Banski Poster Session

FlexTag: A Highly Flexible PoS TaggingFramework

Torsten Zesch and Tobias Horsmann

We present FlexTag, a highly flexible PoS tagging framework. In

contrast to monolithic implementations that can only be retrained

but not adapted otherwise, FlexTag enables users to modify the

feature space and the classification algorithm. Thus, FlexTag

makes it easy to quickly develop custom-made taggers exactly

fitting the research problem.

New Inflectional Lexicons and Training Corporafor Improved Morphosyntactic Annotation ofCroatian and Serbian

Nikola Ljubešic, Filip Klubicka, Željko Agic and Ivo-PavaoJazbec

In this paper we present newly developed inflectional lexcions

and manually annotated corpora of Croatian and Serbian. We

introduce hrLex and srLex - two freely available inflectional

lexicons of Croatian and Serbian - and describe the process of

building these lexicons, supported by supervised machine learning

techniques for lemma and paradigm prediction. Furthermore,

we introduce hr500k, a manually annotated corpus of Croatian,

500 thousand tokens in size. We showcase the three newly

developed resources on the task of morphosyntactic annotation of

both languages by using a recently developed CRF tagger. We

achieve best results yet reported on the task for both languages,

beating the HunPos baseline trained on the same datasets by a

wide margin.

TGermaCorp – A (Digital) Humanities Resourcefor (Computational) Linguistics

Andy Luecking, Armin Hoenen and Alexander Mehler

TGermaCorp is a German text corpus whose primary sources

are collected from German literature texts which date from

the sixteenth century to the present. The corpus is intended

to represent its target language (German) in syntactic, lexical,

stylistic and chronological diversity. For this purpose, it is hand-

annotated on several linguistic layers, including POS, lemma,

named entities, multiword expressions, clauses, sentences and

paragraphs. In order to introduce TGermaCorp in comparison to

more homogeneous corpora of contemporary everyday language,

quantitative assessments of syntactic and lexical diversity are

provided. In this respect, TGermaCorp contributes to establishing

characterising features for resource descriptions, which is needed

for keeping track of a meaningful comparison of the ever-growing

number of natural language resources. The assessments confirm

the special role of proper names, whose propagation in text may

influence lexical and syntactic diversity measures in rather trivial

ways. TGermaCorp will be made available via hucompute.org.

The hunvec framework for NN-CRF-basedsequential tagging

Katalin Pajkossy and Attila Zséder

In this work we present the open source hunvec framework

for sequential tagging, built upon Theano and Pylearn2. The

underlying statistical model, which connects linear CRF-s with

neural networks, was used by Collobert and co-workers, and

several other researchers. For demonstrating the flexibility of

our tool, we describe a set of experiments on part-of-speech

and named-entity-recognition tasks, using English and Hungarian

datasets, where we modify both model and training parameters,

and illustrate the usage of custom features. Model parameters

we experiment with affect the vectorial word representations used

by the model; we apply different word vector initializations,

defined by Word2vec and GloVe embeddings and enrich the

representation of words by vectors assigned trigram features.

We extend training methods by using their regularized (l2 and

dropout) version. When testing our framework on a Hungarian

named entity corpus, we find that its performance reaches the

best published results on this dataset, with no need for language-

specific feature engineering. Our code is available at http:

//github.com/zseder/hunvec.

A Large Scale Corpus of Gulf Arabic

Salam Khalifa, Nizar Habash, Dana Abdulrahim and SaraHassan

Most Arabic natural language processing tools and resources

are developed to serve Modern Standard Arabic (MSA), which

is the official written language in the Arab World. Some

Dialectal Arabic varieties, notably Egyptian Arabic, have received

some attention lately and have a growing collection of resources

that include annotated corpora and morphological analyzers and

taggers. Gulf Arabic, however, lags behind in that respect. In

this paper, we present the Gumar Corpus, a large-scale corpus of

Gulf Arabic consisting of 110 million words from 1,200 forum

novels. We annotate the corpus for sub-dialect information at the

document level. We also present results of a preliminary study

in the morphological annotation of Gulf Arabic which includes

developing guidelines for a conventional orthography. The text

148

of the corpus is publicly browsable through a web interface we

developed for it.

UDPipe: Trainable Pipeline for ProcessingCoNLL-U Files Performing Tokenization,Morphological Analysis, POS Tagging andParsing

Milan Straka, Jan Hajic and Jana Straková

Automatic natural language processing of large texts often

presents recurring challenges in multiple languages: even for

most advanced tasks, the texts are first processed by basic

processing steps – from tokenization to parsing. We present

an extremely simple-to-use tool consisting of one binary and

one model (per language), which performs these tasks for

multiple languages without the need for any other external

data. UDPipe, a pipeline processing CoNLL-U-formatted files,

performs tokenization, morphological analysis, part-of-speech

tagging, lemmatization and dependency parsing for nearly all

treebanks of Universal Dependencies 1.2 (namely, the whole

pipeline is currently available for 32 out of 37 treebanks). In

addition, the pipeline is easily trainable with training data in

CoNLL-U format (and in some cases also with additional raw

corpora) and requires minimal linguistic knowledge on the users’

part. The training code is also released.

Exploiting Arabic Diacritization for High QualityAutomatic Annotation

Nizar Habash, Anas Shahrour and Muhamed Al-Khalil

We present a novel technique for Arabic morphological

annotation. The technique utilizes diacritization to produce

morphological annotations of quality comparable to human

annotators. Although Arabic text is generally written without

diacritics, diacritization is already available for large corpora

of Arabic text in several genres. Furthermore, diacritization

can be generated at a low cost for new text as it does not

require specialized training beyond what educated Arabic typists

know. The basic approach is to enrich the input to a state-

of-the-art Arabic morphological analyzer with word diacritics

(full or partial) to enhance its performance. When applied to

fully diacritized text, our approach produces annotations with an

accuracy of over 97% on lemma, part-of-speech, and tokenization

combined.

A Proposal for a Part-of-Speech Tagset for theAlbanian Language

Besim Kabashi and Thomas Proisl

Part-of-speech tagging is a basic step in Natural Language

Processing that is often essential. Labeling the word forms of

a text with fine-grained word-class information adds new value

to it and can be a prerequisite for downstream processes like

a dependency parser. Corpus linguists and lexicographers also

benefit greatly from the improved search options that are available

with tagged data. The Albanian language has some properties that

pose difficulties for the creation of a part-of-speech tagset. In this

paper, we discuss those difficulties and present a proposal for a

part-of-speech tagset that can adequately represent the underlying

linguistic phenomena.

Using a Small Lexicon with CRFs ConfidenceMeasure to Improve POS Tagging Accuracy

Mohamed Outahajala and Paolo Rosso

Like most of the languages which have only recently started being

investigated for the Natural Language Processing (NLP) tasks,

Amazigh lacks annotated corpora and tools and still suffers from

the scarcity of linguistic tools and resources. The main aim of

this paper is to present a new part-of-speech (POS) tagger based

on a new Amazigh tag set (AMTS) composed of 28 tags. In line

with our goal we have trained Conditional Random Fields (CRFs)

to build a POS tagger for the Amazigh language. We have used

the 10-fold technique to evaluate and validate our approach. The

CRFs 10 folds average level is 87.95% and the best fold level

result is 91.18%. In order to improve this result, we have gathered

a set of about 8k words with their POS tags. The collected

lexicon was used with CRFs confidence measure in order to have

a more accurate POS-tagger. Hence, we have obtained a better

performance of 93.82%.

Learning from Within? Comparing PoS TaggingApproaches for Historical Text

Sarah Schulz and Jonas Kuhn

In this paper, we investigate unsupervised and semi-supervised

methods for part-of-speech (PoS) tagging in the context of

historical German text. We locate our research in the context

of Digital Humanities where the non-canonical nature of text

causes issues facing an Natural Language Processing world in

which tools are mainly trained on standard data. Data deviating

from the norm requires tools adjusted to this data. We explore

to which extend the availability of such training material and

resources related to it influences the accuracy of PoS tagging.

We investigate a variety of algorithms including neural nets,

conditional random fields and self-learning techniques in order

to find the best-fitted approach to tackle data sparsity. Although

methods using resources from related languages outperform

weakly supervised methods using just a few training examples,

149

we can still reach a promising accuracy with methods abstaining

additional resources.

O45 - Lexicons: Wordnet and FramenetFriday, May 27, 14:55

Chairperson: Dan Tufis Oral Session

Wow! What a Useful Extension! IntroducingNon-Referential Concepts to Wordnet

Luís Morgado da Costa and Francis Bond

In this paper we present the ongoing efforts to expand the

depth and breath of the Open Multilingual Wordnet coverage

by introducing two new classes of non-referential concepts to

wordnet hierarchies: interjections and numeral classifiers. The

lexical semantic hierarchy pioneered by Princeton Wordnet has

traditionally restricted its coverage to referential and contentful

classes of words: such as nouns, verbs, adjectives and adverbs.

Previous efforts have been employed to enrich wordnet resources

including, for example, the inclusion of pronouns, determiners

and quantifiers within their hierarchies. Following similar efforts,

and motivated by the ongoing semantic annotation of the NTU-

Multilingual Corpus, we decided that the four traditional classes

of words present in wordnets were too restrictive. Though

non-referential, interjections and classifiers possess interesting

semantics features that can be well captured by lexical resources

like wordnets. In this paper, we will further motivate our

decision to include non-referential concepts in wordnets and give

an account of the current state of this expansion.

SlangNet: A WordNet like resource for EnglishSlang

Shehzaad Dhuliawala, Diptesh Kanojia and PushpakBhattacharyya

We present a WordNet like structured resource for slang words and

neologisms on the internet. The dynamism of language is often

an indication that current language technology tools trained on

today’s data, may not be able to process the language in the future.

Our resource could be (1) used to augment the WordNet, (2) used

in several Natural Language Processing (NLP) applications which

make use of noisy data on the internet like Information Retrieval

and Web Mining. Such a resource can also be used to distinguish

slang word senses from conventional word senses. To stimulate

similar innovations widely in the NLP community, we test the

efficacy of our resource for detecting slang using standard bag of

words Word Sense Disambiguation (WSD) algorithms (Lesk and

Extended Lesk) for English data on the internet.

Discovering Fuzzy Synsets from the Redundancyin Different Lexical-Semantic Resources

Hugo Gonçalo Oliveira and Fábio Santos

Although represented as such in wordnets, word senses are

not discrete. To handle word senses as fuzzy objects, we

exploit the graph structure of synonymy pairs acquired from

different sources to discover synsets where words have different

membership degrees that reflect confidence. Following this

approach, a wide-coverage fuzzy thesaurus was discovered from

a synonymy network compiled from seven Portuguese lexical-

semantic resources. Based on a crowdsourcing evaluation, we

can say that the quality of the obtained synsets is far from perfect

but, as expected in a confidence measure, it increases significantly

for higher cut-points on the membership and, at a certain point,

reaches 100% correction rate.

The Hebrew FrameNet Project

Avi Hayoun and Michael Elhadad

We present the Hebrew FrameNet project, describe the

development and annotation processes and enumerate the

challenges we faced along the way. We have developed semi-

automatic tools to help speed the annotation and data collection

process. The resource currently covers 167 frames, 3,000 lexical

units and about 500 fully annotated sentences. We have started

training and testing automatic SRL tools on the seed data.

O46 - Digital HumanitiesFriday, May 27, 14:55

Chairperson: Andreas Witt Oral Session

An Open Corpus for Named Entity Recognition inHistoric Newspapers

Clemens Neudecker

The availability of openly available textual datasets (“corpora”)

with highly accurate manual annotations (“gold standard” of

named entities (e.g. persons, locations, organizations, etc.) is

crucial in the training and evaluation of named entity recognition

systems. Currently there are only few such datasets available

on the web, and even less for texts containing historical spelling

variation. The production and subsequent release into the public

domain of four such datasets with 100 pages each for the

languages Dutch, French, German (including Austrian) as part

of the Europeana Newspapers project is expected to contribute

to the further development and improvement of named entity

150

recognition systems with a focus on historical content. This paper

describes how these datasets were produced, what challenges were

encountered in their creation and informs about their final quality

and availability.

Ambiguity Diagnosis for Terms in DigitalHumanities

Béatrice Daille, Evelyne Jacquey, Gaël Lejeune, LuisFelipe Melo and Yannick Toussaint

Among all researches dedicating to terminology and word sense

disambiguation, little attention has been devoted to the ambiguity

of term occurrences. If a lexical unit is indeed a term of

the domain, it is not true, even in a specialised corpus, that

all its occurrences are terminological. Some occurrences are

terminological and other are not. Thus, a global decision at the

corpus level about the terminological status of all occurrences of a

lexical unit would then be erroneous. In this paper, we propose

three original methods to characterise the ambiguity of term

occurrences in the domain of social sciences for French. These

methods differently model the context of the term occurrences:

one is relying on text mining, the second is based on textometry,

and the last one focuses on text genre properties. The experimental

results show the potential of the proposed approaches and give an

opportunity to discuss about their hybridisation.

Metrical Annotation of a Large Corpus of SpanishSonnets: Representation, Scansion and Evaluation

Borja Navarro, María Ribes-Lafoz and Noelia Sánchez

In order to analyze metrical and semantics aspects of poetry in

Spanish with computational techniques, we have developed a

large corpus annotated with metrical information. In this paper

we will present and discuss the development of this corpus: the

formal representation of metrical patterns, the semi-automatic

annotation process based on a new automatic scansion system, the

main annotation problems, and the evaluation, in which an inter-

annotator agreement of 96% has been obtained. The corpus is

open and available.

Corpus Analysis based on Structural Phenomenain Texts: Exploiting TEI Encoding for LinguisticResearch

Susanne Haaf

This paper poses the question, how linguistic corpus-based

research may be enriched by the exploitation of conceptual text

structures and layout as provided via TEI annotation. Examples

for possible areas of research and usage scenarios are provided

based on the German historical corpus of the Deutsches Textarchiv

(DTA) project, which has been consistently tagged accordant

to the TEI Guidelines, more specifically to the DTA ›Base

Format‹ (DTABf). The paper shows that by including TEI-XML

structuring in corpus-based analyses significances can be observed

for different linguistic phenomena, as e.g. the development of

conceptual text structures themselves, the syntactic embedding

of terms in certain conceptual text structures, and phenomena of

language change which become obvious via the layout of a text.

The exemplary study carried out here shows some of the potential

for the exploitation of TEI annotation for linguistic research,

which might be kept in mind when making design decisions for

new corpora as well when working with existing TEI corpora.

O47 - Text Mining and InformationExtractionFriday, May 27, 14:55

Chairperson: Gregory Grefenstette Oral Session

Evaluating Entity Linking: An Analysis ofCurrent Benchmark Datasets and a Roadmap forDoing a Better Job

Marieke van Erp, Pablo Mendes, Heiko Paulheim, FilipIlievski, Julien Plu, Giuseppe Rizzo and Joerg Waitelonis

Entity linking has become a popular task in both natural language

processing and semantic web communities. However, we find that

the benchmark datasets for entity linking tasks do not accurately

evaluate entity linking systems. In this paper, we aim to chart

the strengths and weaknesses of current benchmark datasets and

sketch a roadmap for the community to devise better benchmark

datasets.

Studying the Temporal Dynamics of WordCo-occurrences: An Application to EventDetection

Daniel Preotiuc-Pietro, P. K. Srijith, Mark Hepple andTrevor Cohn

Streaming media provides a number of unique challenges for

computational linguistics. This paper studies the temporal

variation in word co-occurrence statistics, with application to

event detection. We develop a spectral clustering approach to find

groups of mutually informative terms occurring in discrete time

frames. Experiments on large datasets of tweets show that these

groups identify key real world events as they occur in time, despite

no explicit supervision. The performance of our method rivals

151

state-of-the-art methods for event detection on F-score, obtaining

higher recall at the expense of precision.

Markov Logic Networks for Text Mining: AQualitative and Empirical Comparison withInteger Linear Programming

Luis Gerardo Mojica de la Vega and Vincent Ng

Joint inference approaches such as Integer Linear Programming

(ILP) and Markov Logic Networks (MLNs) have recently been

successfully applied to many natural language processing (NLP)

tasks, often outperforming their pipeline counterparts. However,

MLNs are arguably much less popular among NLP researchers

than ILP. While NLP researchers who desire to employ these joint

inference frameworks do not necessarily have to understand their

theoretical underpinnings, it is imperative that they understand

which of them should be applied under what circumstances.

With the goal of helping NLP researchers better understand the

relative strengths and weaknesses of MLNs and ILP; we will

compare them along different dimensions of interest, such as

expressiveness, ease of use, scalability, and performance. To our

knowledge, this is the first systematic comparison of ILP and

MLNs on an NLP task.

Arabic Corpora for Credibility Analysis

Ayman Al Zaatari, Rim El Ballouli, Shady ELbassouni,Wassim El-Hajj, Hazem Hajj, Khaled Shaban, NizarHabash and Emad Yahya

A significant portion of data generated on blogging and

microblogging websites is non-credible as shown in many recent

studies. To filter out such non-credible information, machine

learning can be deployed to build automatic credibility classifiers.

However, as in the case with most supervised machine learning

approaches, a sufficiently large and accurate training data must

be available. In this paper, we focus on building a public Arabic

corpus of blogs and microblogs that can be used for credibility

classification. We focus on Arabic due to the recent popularity of

blogs and microblogs in the Arab World and due to the lack of any

such public corpora in Arabic. We discuss our data acquisition

approach and annotation process, provide rigid analysis on the

annotated data and finally report some results on the effectiveness

of our data for credibility classification.

O48 - Corpus Creation and AnalysisFriday, May 27, 14:55

Chairperson: Paul Rayson Oral Session

Solving the AL Chicken-and-Egg Corpus andModel Problem: Model-free Active Learning forPhenomena-driven Corpus Construction

Dain Kaplan, Neil Rubens, Simone Teufel and TakenobuTokunaga

Active learning (AL) is often used in corpus construction (CC)

for selecting “informative” documents for annotation. This is

ideal for focusing annotation efforts when all documents cannot

be annotated, but has the limitation that it is carried out in a

closed-loop, selecting points that will improve an existing model.

For phenomena-driven and exploratory CC, the lack of existing-

models and specific task(s) for using it make traditional AL

inapplicable. In this paper we propose a novel method for model-

free AL utilising characteristics of phenomena for applying AL to

select documents for annotation. The method can also supplement

traditional closed-loop AL-based CC to extend the utility of the

corpus created beyond a single task. We introduce our tool,

MOVE, and show its potential with a real world case-study.

QUEMDISSE? Reported speech in Portuguese

Cláudia Freitas, Bianca Freitas and Diana Santos

This paper presents some work on direct and indirect speech in

Portuguese using corpus-based methods: we report on a study

whose aim was to identify (i) Portuguese verbs used to introduce

reported speech and (ii) syntactic patterns used to convey reported

speech, in order to enhance the performance of a quotation

extraction system, dubbed QUEMDISSE?. In addition, (iii) we

present a Portuguese corpus annotated with reported speech, using

the lexicon and rules provided by (i) and (ii), and discuss the

process of their annotation and what was learned.

MEANTIME, the NewsReader MultilingualEvent and Time Corpus

Anne-Lyse Minard, Manuela Speranza, Ruben Urizar,Begoña Altuna, Marieke van Erp, Anneleen Schoen andChantal van Son

In this paper, we present the NewsReader MEANTIME corpus, a

semantically annotated corpus of Wikinews articles. The corpus

consists of 480 news articles, i.e. 120 English news articles and

their translations in Spanish, Italian, and Dutch. MEANTIME

contains annotations at different levels. The document-level

annotation includes markables (e.g. entity mentions, event

mentions, time expressions, and numerical expressions), relations

152

between markables (modeling, for example, temporal information

and semantic role labeling), and entity and event intra-document

coreference. The corpus-level annotation includes entity and

event cross-document coreference. Semantic annotation on the

English section was performed manually; for the annotation in

Italian, Spanish, and (partially) Dutch, a procedure was devised to

automatically project the annotations on the English texts onto the

translated texts, based on the manual alignment of the annotated

elements; this enabled us not only to speed up the annotation

process but also provided cross-lingual coreference. The English

section of the corpus was extended with timeline annotations for

the SemEval 2015 TimeLine shared task. The “First CLIN Dutch

Shared Task” at CLIN26 was based on the Dutch section, while

the EVALITA 2016 FactA (Event Factuality Annotation) shared

task, based on the Italian section, is currently being organized.

The ACQDIV Database: Min(d)ing the AmbientLanguage

Steven Moran

One of the most pressing questions in cognitive science remains

unanswered: what cognitive mechanisms enable children to learn

any of the world’s 7000 or so languages? Much discovery has

been made with regard to specific learning mechanisms in specific

languages, however, given the remarkable diversity of language

structures (Evans and Levinson 2009, Bickel 2014) the burning

question remains: what are the underlying processes that make

language acquisition possible, despite substantial cross-linguistic

variation in phonology, morphology, syntax, etc.? To investigate

these questions, a comprehensive cross-linguistic database of

longitudinal child language acquisition corpora from maximally

diverse languages has been built.

P53 - Dialogue (2)Friday, May 27, 14:55

Chairperson: Thorsten Trippel Poster Session

Summarizing Behaviours: An Experiment on theAnnotation of Call-Centre Conversations

Morena Danieli, Balamurali A R, Evgeny Stepanov, BenoitFavre, Frederic Bechet and Giuseppe Riccardi

Annotating and predicting behavioural aspects in conversations is

becoming critical in the conversational analytics industry. In this

paper we look into inter-annotator agreement of agent behaviour

dimensions on two call center corpora. We find that the task can

be annotated consistently over time, but that subjectivity issues

impacts the quality of the annotation. The reformulation of some

of the annotated dimensions is suggested in order to improve

agreement.

Survey of Conversational Behavior: Towards theDesign of a Balanced Corpus of EverydayJapanese Conversation

Hanae Koiso, Tomoyuki Tsuchiya, Ryoko Watanabe,Daisuke Yokomori, Masao Aizawa and Yasuharu Den

In 2016, we set about building a large-scale corpus of everyday

Japanese conversation–a collection of conversations embedded

in naturally occurring activities in daily life. We will collect

more than 200 hours of recordings over six years,publishing

the corpus in 2022. To construct such a huge corpus, we

have conducted a pilot project, one of whose purposes is to

establish a corpus design for collecting various kinds of everyday

conversations in a balanced manner. For this purpose, we

conducted a survey of everyday conversational behavior, with

about 250 adults, in order to reveal how diverse our everyday

conversational behavior is and to build an empirical foundation

for corpus design. The questionnaire included when, where, how

long,with whom, and in what kind of activity informants were

engaged in conversations. We found that ordinary conversations

show the following tendencies: i) they mainly consist of chats,

business talks, and consultations; ii) in general, the number of

participants is small and the duration of the conversation is short;

iii) many conversations are conducted in private places such as

homes, as well as in public places such as offices and schools; and

iv) some questionnaire items are related to each other. This paper

describes an overview of this survey study, and then discusses how

to design a large-scale corpus of everyday Japanese conversation

on this basis.

A Multi-party Multi-modal Dataset for Focus ofVisual Attention in Human-human andHuman-robot Interaction

Kalin Stefanov and Jonas Beskow

This papers describes a data collection setup and a newly recorded

dataset. The main purpose of this dataset is to explore patterns

in the focus of visual attention of humans under three different

conditions - two humans involved in task-based interaction with a

robot; same two humans involved in task-based interaction where

the robot is replaced by a third human, and a free three-party

human interaction. The dataset contains two parts - 6 sessions with

duration of approximately 3 hours and 9 sessions with duration

of approximately 4.5 hours. Both parts of the dataset are rich in

modalities and recorded data streams - they include the streams

of three Kinect v2 devices (color, depth, infrared, body and face

data), three high quality audio streams, three high resolution

153

GoPro video streams, touch data for the task-based interactions

and the system state of the robot. In addition, the second part

of the dataset introduces the data streams from three Tobii Pro

Glasses 2 eye trackers. The language of all interactions is English

and all data streams are spatially and temporally aligned.

Internet Argument Corpus 2.0: An SQL schemafor Dialogic Social Media and the Corpora to gowith it

Rob Abbott, Brian Ecker, Pranav Anand and MarilynWalker

Large scale corpora have benefited many areas of research in

natural language processing, but until recently, resources for

dialogue have lagged behind. Now, with the emergence of large

scale social media websites incorporating a threaded dialogue

structure, content feedback, and self-annotation (such as stance

labeling), there are valuable new corpora available to researchers.

In previous work, we released the INTERNET ARGUMENT

CORPUS, one of the first larger scale resources available for

opinion sharing dialogue. We now release the INTERNET

ARGUMENT CORPUS 2.0 (IAC 2.0) in the hope that others will

find it as useful as we have. The IAC 2.0 provides more data

than IAC 1.0 and organizes it using an extensible, repurposable

SQL schema. The database structure in conjunction with the

associated code facilitates querying from and combining multiple

dialogically structured data sources. The IAC 2.0 schema provides

support for forum posts, quotations, markup (bold, italic, etc), and

various annotations, including Stanford CoreNLP annotations.

We demonstrate the generalizablity of the schema by providing

code to import the ConVote corpus.

Capturing Chat: Annotation and Tools forMultiparty Casual Conversation.

Emer Gilmartin and Nick Campbell

Casual multiparty conversation is an understudied but very

common genre of spoken interaction, whose analysis presents a

number of challenges in terms of data scarcity and annotation.

We describe the annotation process used on the d64 and DANS

multimodal corpora of multiparty casual talk, which have been

manually segmented, transcribed, annotated for laughter and

disfluencies, and aligned using the Penn Aligner. We also describe

a visualization tool, STAVE, developed during the annotation

process, which allows long stretches of talk or indeed entire

conversations to be viewed, aiding preliminary identification of

features and patterns worthy of analysis. It is hoped that this tool

will be of use to other researchers working in this field.

P54 - LR Infrastructures and Architectures (2)Friday, May 27, 14:55

Chairperson: Koiti Hasida Poster Session

Privacy Issues in Online Machine TranslationServices - European PerspectivePawel Kamocki and Jim O’Regan

In order to develop its full potential, global communication

needs linguistic support systems such as Machine Translation

(MT). In the past decade, free online MT tools have become

available to the general public, and the quality of their output is

increasing. However, the use of such tools may entail various legal

implications, especially as far as processing of personal data is

concerned. This is even more evident if we take into account that

their business model is largely based on providing translation in

exchange for data, which can subsequently be used to improve the

translation model, but also for commercial purposes. The purpose

of this paper is to examine how free online MT tools fit in the

European data protection framework, harmonised by the EU Data

Protection Directive. The perspectives of both the user and the

MT service provider are taken into account.

Lin|gu|is|tik: Building the Linguist’s Pathway toBibliographies, Libraries, Language Resourcesand Linked Open DataChristian Chiarcos, Christian Fäth, Heike Renner-Westermann, Frank Abromeit and Vanya Dimitrova

This paper introduces a novel research tool for the field of

linguistics: The Lin|gu|is|tik web portal provides a virtual

library which offers scientific information on every linguistic

subject. It comprises selected internet sources and databases

as well as catalogues for linguistic literature, and addresses

an interdisciplinary audience. The virtual library is the most

recent outcome of the Special Subject Collection Linguistics

of the German Research Foundation (DFG), and also integrates

the knowledge accumulated in the Bibliography of Linguistic

Literature. In addition to the portal, we describe long-term goals

and prospects with a special focus on ongoing efforts regarding an

extension towards integrating language resources and Linguistic

Linked Open Data.

Towards a Language Service Infrastructure forMobile EnvironmentsNgoc Nguyen, Donghui Lin, Takao Nakaguchi and ToruIshida

Since mobile devices have feature-rich configurations and provide

diverse functions, the use of mobile devices combined with the

language resources of cloud environments is high promising

for achieving a wide range communication that goes beyond

the current language barrier. However, there are mismatches

154

between using resources of mobile devices and services in the

cloud such as the different communication protocol and different

input and output methods. In this paper, we propose a language

service infrastructure for mobile environments to combine these

services. The proposed language service infrastructure allows

users to use and mashup existing language resources on both cloud

environments and their mobile devices. Furthermore, it allows

users to flexibly use services in the cloud or services on mobile

devices in their composite service without implementing several

different composite services that have the same functionality. A

case study of Mobile Shopping Translation System using both a

service in the cloud (translation service) and services on mobile

devices (Bluetooth low energy (BLE) service and text-to-speech

service) is introduced.

Designing A Long Lasting Linguistic Project: TheCase Study of ASIt

Maristella Agosti, Emanuele Di Buccio, Giorgio Maria DiNunzio, Cecilia Poletto and Esther Rinke

In this paper, we discuss the requirements that a long lasting

linguistic database should have in order to meet the needs of the

linguists together with the aim of durability and sharing of data.

In particular, we discuss the generalizability of the Syntactic Atlas

of Italy, a linguistic project that builds on a long standing tradition

of collecting and analyzing linguistic corpora, on a more recent

project that focuses on the synchronic and diachronic analysis

of the syntax of Italian and Portuguese relative clauses. The

results that are presented are in line with the FLaReNet Strategic

Agenda that highlighted the most pressing needs for research

areas, such as Natural Language Processing, and presented a set of

recommendations for the development and progress of Language

resources in Europe.

Global Open Resources and Information forLanguage and Linguistic Analysis (GORILLA)

Damir Cavar, Malgorzata Cavar and Lwin Moe

The infrastructure Global Open Resources and Information for

Language and Linguistic Analysis (GORILLA) was created as

a resource that provides a bridge between disciplines such as

documentary, theoretical, and corpus linguistics, speech and

language technologies, and digital language archiving services.

GORILLA is designed as an interface between digital language

archive services and language data producers. It addresses various

problems of common digital language archive infrastructures.

At the same time it serves the speech and language technology

communities by providing a platform to create and share speech

and language data from low-resourced and endangered languages.

It hosts an initial collection of language models for speech and

natural language processing (NLP), and technologies or software

tools for corpus creation and annotation. GORILLA is designed to

address the Transcription Bottleneck in language documentation,

and, at the same time to provide solutions to the general Language

Resource Bottleneck in speech and language technologies. It

does so by facilitating the cooperation between documentary

and theoretical linguistics, and speech and language technologies

research and development, in particular for low-resourced and

endangered languages.

corpus-tools.org: An Interoperable GenericSoftware Tool Set for Multi-layer LinguisticCorpora

Stephan Druskat, Volker Gast, Thomas Krause and FlorianZipser

This paper introduces an open source, interoperable generic

software tool set catering for the entire workflow of creation,

migration, annotation, query and analysis of multi-layer linguistic

corpora. It consists of four components: Salt, a graph-based meta

model and API for linguistic data, the common data model for

the rest of the tool set; Pepper, a conversion tool and platform for

linguistic data that can be used to convert many different linguistic

formats into each other; Atomic, an extensible, platform-

independent multi-layer desktop annotation software for linguistic

corpora; ANNIS, a search and visualization architecture for multi-

layer linguistic corpora with many different visualizations and a

powerful native query language. The set was designed to solve the

following issues in a multi-layer corpus workflow: Lossless data

transition between tools through a common data model generic

enough to allow for a potentially unlimited number of different

types of annotation, conversion capabilities for different linguistic

formats to cater for the processing of data from different sources

and/or with existing annotations, a high level of extensibility

to enhance the sustainability of the whole tool set, analysis

capabilities encompassing corpus and annotation query alongside

multi-faceted visualizations of all annotation layers.

CommonCOW: Massively Huge Web Corporafrom CommonCrawl Data and a Method toDistribute them Freely under Restrictive EUCopyright Laws

Roland Schäfer

In this paper, I describe a method of creating massively huge

web corpora from the CommonCrawl data sets and redistributing

the resulting annotations in a stand-off format. Current EU (and

especially German) copyright legislation categorically forbids

the redistribution of downloaded material without express prior

permission by the authors. Therefore, such stand-off annotations

(or other derivates) are the only format in which European

155

researchers (like myself) are allowed to re-distribute the respective

corpora. In order to make the full corpora available to the

public despite such restrictions, the stand-off format presented

here allows anybody to locally reconstruct the full corpora with

the least possible computational effort.

CLARIN-EL Web-based Annotation Tool

Ioannis Manousos Katakis, Georgios Petasis and VangelisKarkaletsis

This paper presents a new Web-based annotation tool, the

“CLARIN-EL Web-based Annotation Tool”. Based on an existing

annotation infrastructure offered by the “Ellogon” language

enginneering platform, this new tool transfers a large part of

Ellogon’s features and functionalities to a Web environment,

by exploiting the capabilities of cloud computing. This new

annotation tool is able to support a wide range of annotation

tasks, through user provided annotation schemas in XML. The

new annotation tool has already been employed in several

annotation tasks, including the anotation of arguments, which

is presented as a use case. The CLARIN-EL annotation tool

is compared to existing solutions along several dimensions and

features. Finally, future work includes the improvement of

integration with the CLARIN-EL infrastructure, and the inclusion

of features not currently supported, such as the annotation of

aligned documents.

Two Architectures for Parallel Processing of HugeAmounts of Text

Mathijs Kattenberg, Zuhaitz Beloki, Aitor Soroa, XabierArtola, Antske Fokkens, Paul Huygen and Kees Verstoep

This paper presents two alternative NLP architectures to analyze

massive amounts of documents, using parallel processing. The

two architectures focus on different processing scenarios, namely

batch-processing and streaming processing. The batch-processing

scenario aims at optimizing the overall throughput of the

system, i.e., minimizing the overall time spent on processing all

documents. The streaming architecture aims to minimize the

time to process real-time incoming documents and is therefore

especially suitable for live feeds. The paper presents experiments

with both architectures, and reports the overall gain when they

are used for batch as well as for streaming processing. All the

software described in the paper is publicly available under free

licenses.

Publishing the Trove Newspaper Corpus

Steve Cassidy

The Trove Newspaper Corpus is derived from the National Library

of Australia’s digital archive of newspaper text. The corpus is a

snapshot of the NLA collection taken in 2015 to be made available

for language research as part of the Alveo Virtual Laboratory and

contains 143 million articles dating from 1806 to 2007. This

paper describes the work we have done to make this large corpus

available as a research collection, facilitating access to individual

documents and enabling large scale processing of the newspaper

text in a cloud-based environment.

New Developments in the LRE MapVladimir Popescu, Lin Liu, Riccardo Del Gratta, KhalidChoukri and Nicoletta Calzolari

In this paper we describe the new developments brought to

LRE Map, especially in terms of the user interface of the Web

application, of the searching of the information therein, and of

the data model updates. Thus, users now have several new search

facilities, such as faceted search and fuzzy textual search, they

can now register, log in and store search bookmarks for further

perusal. Moreover, the data model now includes the notion of

paper and author, which allows for linking the resources to the

scientific works. Also, users can now visualise author-provided

field values and normalised values. The normalisation has been

manual and enables a better grouping of the entries. Last but not

least, provisions have been made towards linked open data (LOD)

aspects, by exposing an RDF access point allowing to query on the

authors, papers and resources. Finally, a complete technological

overhaul of the whole application has been undertaken, especially

in terms of the Web infrastructure and of the text search backend.

P55 - Large Projects and Infrastructures (2)Friday, May 27, 14:55

Chairperson: Dieter Van Uytvanck Poster Session

Hidden Resources – Strategies to Acquire andExploit Potential Spoken Language Resources inNational ArchivesJens Edlund and Joakim Gustafson

In 2014, the Swedish government tasked a Swedish agency, The

Swedish Post and Telecom Authority (PTS), with investigating

how to best create and populate an infrastructure for spoken

language resources (Ref N2014/2840/ITP). As a part of this work,

the department of Speech, Music and Hearing at KTH Royal

Institute of Technology have taken inventory of existing potential

spoken language resources, mainly in Swedish national archives

and other governmental or public institutions. In this position

paper, key priorities, perspectives, and strategies that may be of

general, rather than Swedish, interest are presented. We discuss

broad types of potential spoken language resources available; to

what extent these resources are free to use; and thirdly the main

contribution: strategies to ensure the continuous acquisition of

156

spoken language resources in a manner that facilitates speech and

speech technology research.

The ELRA License Wizard

Valérie Mapelli, Vladimir Popescu, Lin Liu, MeritxellFernández Barrera and Khalid Choukri

To allow an easy understanding of the various licenses that exist

for the use of Language Resources (ELRA’s, META-SHARE’s,

Creative Commons’, etc.), ELRA has developed a License

Wizardto help the right-holders share/distribute their resources

under the appropriate license. It also aims to be exploited by

users to better understand the legal obligations that apply in

various licensing situations. The present paper elaborates on the

License Wizard functionalities of this web configurator, which

enables to select a number of legal features and obtain the user

license adapted to the users selection, to define which user licenses

they would like to select in order to distribute their Language

Resources, to integrate the user license terms into a Distribution

Agreement that could be proposed to ELRA or META-SHARE

for further distribution through the ELRA Catalogue of Language

Resources. Thanks to a flexible back office, the structure of the

legal feature selection can easily be reviewed to include other

features that may be relevant for other licenses. Integrating

contributions from other initiatives thus aim to be one of the

obvious next steps, with a special focus on CLARIN and Linked

Data experiences.

Review on the Existing Language Resources forLanguages of France

Thibault Grouas, Valérie Mapelli and Quentin Samier

With the support of the DGLFLF, ELDA conducted an inventory

of existing language resources for the regional languages of

France. The main aim of this inventory was to assess the

exploitability of the identified resources within technologies.

A total of 2,299 Language Resources were identified. As a

second step, a deeper analysis of a set of three language groups

(Breton, Occitan, overseas languages) was carried out along with

a focus of their exploitability within three technologies: automatic

translation, voice recognition/synthesis and spell checkers. The

survey was followed by the organisation of the TLRF2015

Conference which aimed to present the state of the art in the field

of the Technologies for Regional Languages of France. The next

step will be to activate the network of specialists built up during

the TLRF conference and to begin the organisation of a second

TLRF conference. Meanwhile, the French Ministry of Culture

continues its actions related to linguistic diversity and technology,

in particular through a project with Wikimedia France related to

contributions to Wikipedia in regional languages, the upcoming

new version of the “Corpus de la Parole” and the reinforcement of

the DGLFLF’s Observatory of Linguistic Practices.

Selection Criteria for Low Resource LanguagePrograms

Christopher Cieri, Mike Maxwell, Stephanie Strassel andJennifer Tracey

This paper documents and describes the criteria used to select

languages for study within programs that include low resource

languages whether given that label or another similar one. It

focuses on five US common task, Human Language Technology

research and development programs in which the authors have

provided information or consulting related to the choice of

language. The paper does not describe the actual selection process

which is the responsibility of program management and highly

specific to a program’s individual goals and context. Instead it

concentrates on the data and criteria that have been considered

relevant previously with the thought that future program managers

and their consultants may adapt these and apply them with

different prioritization to future programs.

Enhancing Cross-border EU E-commerce throughMachine Translation: Needed LanguageResources, Challenges and Opportunities

Meritxell Fernández Barrera, Vladimir Popescu, AntonioToral, Federico Gaspari and Khalid Choukri

This paper discusses the role that statistical machine translation

(SMT) can play in the development of cross-border EU

e-commerce,by highlighting extant obstacles and identifying

relevant technologies to overcome them. In this sense, it firstly

proposes a typology of e-commerce static and dynamic textual

genres and it identifies those that may be more successfully

targeted by SMT. The specific challenges concerning the

automatic translation of user-generated content are discussed

in detail. Secondly, the paper highlights the risk of data

sparsity inherent to e-commerce and it explores the state-of-

the-art strategies to achieve domain adequacy via adaptation.

Thirdly, it proposes a robust workflow for the development of

SMT systems adapted to the e-commerce domain by relying

on inexpensive methods. Given the scarcity of user-generated

language corpora for most language pairs, the paper proposes to

obtain monolingual target-language data to train language models

and aligned parallel corpora to tune and evaluate MT systems by

means of crowdsourcing.

157

P56 - Semantics (2)Friday, May 27, 14:55

Chairperson: Yoshihiko Hayashi Poster Session

Nine Features in a Random Forest to LearnTaxonomical Semantic Relations

Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Luand Chu-Ren Huang

ROOT9 is a supervised system for the classification of hypernyms,

co-hyponyms and random words that is derived from the already

introduced ROOT13 (Santus et al., 2016). It relies on a Random

Forest algorithm and nine unsupervised corpus-based features.

We evaluate it with a 10-fold cross validation on 9,600 pairs,

equally distributed among the three classes and involving several

Parts-Of-Speech (i.e. adjectives, nouns and verbs). When all the

classes are present, ROOT9 achieves an F1 score of 90.7%, against

a baseline of 57.2% (vector cosine). When the classification is

binary, ROOT9 achieves the following results against the baseline.

hypernyms-co-hyponyms 95.7% vs. 69.8%, hypernyms-random

91.8% vs. 64.1% and co-hyponyms-random 97.8% vs. 79.4%.

In order to compare the performance with the state-of-the-art,

we have also evaluated ROOT9 in subsets of the Weeds et al.

(2014) datasets, proving that it is in fact competitive. Finally, we

investigated whether the system learns the semantic relation or it

simply learns the prototypical hypernyms, as claimed by Levy et

al. (2015). The second possibility seems to be the most likely,

even though ROOT9 can be trained on negative examples (i.e.,

switched hypernyms) to drastically reduce this bias.

What a Nerd! Beating Students and Vector Cosinein the ESL and TOEFL Datasets

Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Luand Chu-Ren Huang

In this paper, we claim that Vector Cosine – which is generally

considered one of the most efficient unsupervised measures

for identifying word similarity in Vector Space Models – can

be outperformed by a completely unsupervised measure that

evaluates the extent of the intersection among the most associated

contexts of two target words, weighting such intersection

according to the rank of the shared contexts in the dependency

ranked lists. This claim comes from the hypothesis that similar

words do not simply occur in similar contexts, but they share a

larger portion of their most relevant contexts compared to other

related words. To prove it, we describe and evaluate APSyn, a

variant of Average Precision that – independently of the adopted

parameters – outperforms the Vector Cosine and the co-occurrence

on the ESL and TOEFL test sets. In the best setting, APSyn

reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy in

the TOEFL dataset, beating therefore the non-English US college

applicants (whose average, as reported in the literature, is 64.50%)

and several state-of-the-art approaches.

Assessing the Potential of Metaphoricity of verbsusing corpus data

Marco Del Tredici and Nuria Bel

The paper investigates the relation between metaphoricity and

distributional characteristics of verbs, introducing POM, a corpus-

derived index that can be used to define the upper bound of

metaphoricity of any expression in which a given verb occurs.

The work moves from the observation that while some verbs

can be used to create highly metaphoric expressions, others can

not. We conjecture that this fact is related to the number of

contexts in which a verb occurs and to the frequency of each

context. This intuition is modelled by introducing a method in

which each context of a verb in a corpus is assigned a vector

representation, and a clustering algorithm is employed to identify

similar contexts. Eventually, the Standard Deviation of the relative

frequency values of the clusters is computed and taken as the POM

of the target verb. We tested POM in two experimental settings

obtaining values of accuracy of 84% and 92%. Since we are

convinced, along with (Shutoff, 2015), that metaphor detection

systems should be concerned only with the identification of highly

metaphoric expressions, we believe that POM could be profitably

employed by these systems to a priori exclude expressions that,

due to the verb they include, can only have low degrees of

metaphoricity

Semantic Relation Extraction with SemanticPatterns Experiment on Radiology Reports

Mathieu Lafourcade and Lionel Ramadier

This work presents a practical system for indexing terms and

relations from French radiology reports, called IMAIOS. In this

paper, we present how semantic relations (causes, consequences,

symptoms, locations, parts. . . ) between medical terms can be

extracted. For this purpose, we handcrafted some linguistic

patterns from on a subset of our radiology report corpora. As

semantic patterns (de (of)) may be too general or ambiguous,

semantic constraints have been added. For instance, in the

sentence néoplasie du sein (neoplasm of breast) the system

knowing neoplasm as a disease and breast as an anatomical

location, identify the relation as being a location: neoplasm r-

158

lieu breast. An evaluation of the effect of semantic constraints

is proposed.

EVALution-MAN: A Chinese Dataset for theTraining and Evaluation of DSMs

Liu Hongchao, Karl Neergaard, Enrico Santus and Chu-Ren Huang

Distributional semantic models (DSMs) are currently being used

in the measurement of word relatedness and word similarity. One

shortcoming of DSMs is that they do not provide a principled way

to discriminate different semantic relations. Several approaches

have been adopted that rely on annotated data either in the

training of the model or later in its evaluation. In this paper, we

introduce a dataset for training and evaluating DSMs on semantic

relations discrimination between words, in Mandarin, Chinese.

The construction of the dataset followed EVALution 1.0, which

is an English dataset for the training and evaluating of DSMs.

The dataset contains 360 relation pairs, distributed in five different

semantic relations, including antonymy, synonymy, hypernymy,

meronymy and nearsynonymy. All relation pairs were checked

manually to estimate their quality. In the 360 word relation pairs,

there are 373 relata. They were all extracted and subsequently

manually tagged according to their semantic type. The relatas’

frequency was calculated in a combined corpus of Sinica and

Chinese Gigaword. To the best of our knowledge, EVALution-

MAN is the first of its kind for Mandarin, Chinese.

Towards Building Semantic Role Labeler forIndian Languages

Maaz Anwar and Dipti Sharma

We present a statistical system for identifying the semantic

relationships or semantic roles for two major Indian Languages,

Hindi and Urdu. Given an input sentence and a predicate/verb, the

system first identifies the arguments pertaining to that verb and

then classifies it into one of the semantic labels which can either

be a DOER, THEME, LOCATIVE, CAUSE, PURPOSE etc. The

system is based on 2 statistical classifiers trained on roughly

130,000 words for Urdu and 100,000 words for Hindi that were

hand-annotated with semantic roles under the PropBank project

for these two languages. Our system achieves an accuracy of

86% in identifying the arguments of a verb for Hindi and 75% for

Urdu. At the subsequent task of classifying the constituents into

their semantic roles, the Hindi system achieved 58% precision and

42% recall whereas Urdu system performed better and achieved

83% precision and 80% recall. Our study also allowed us to

compare the usefulness of different linguistic features and feature

combinations in the semantic role labeling task. We also examine

the use of statistical syntactic parsing as feature in the role labeling

task.

A Framework for Automatic Acquisition ofCroatian and Serbian Verb Aspect from CorporaTanja Samardzic and Maja Milicevic

Verb aspect is a grammatical and lexical category that encodes

temporal unfolding and duration of events described by verbs.

It is a potentially interesting source of information for various

computational tasks, but has so far not been studied in much depth

from the perspective of automatic processing. Slavic languages

are particularly interesting in this respect, as they encode aspect

through complex and not entirely consistent lexical derivations

involving prefixation and suffixation. Focusing on Croatian and

Serbian, in this paper we propose a novel framework for automatic

classification of their verb types into a number of fine-grained

aspectual classes based on the observable morphology of verb

forms. In addition, we provide a set of around 2000 verbs

classified based on our framework. This set can be used for

linguistic research as well as for testing automatic classification

on a larger scale. With minor adjustments the approach is also

applicable to other Slavic languages.

Monolingual Social Media Datasets for DetectingContradiction and EntailmentPiroska Lendvai, Isabelle Augenstein, Kalina Bontchevaand Thierry Declerck

Entailment recognition approaches are useful for application

domains such as information extraction, question answering

or summarisation, for which evidence from multiple sentences

needs to be combined. We report on a new 3-way judgement

Recognizing Textual Entailment (RTE) resource that originates in

the Social Media domain, and explain our semi-automatic creation

method for the special purpose of information verification, which

draws on manually established rumourous claims reported during

crisis events. From about 500 English tweets related to 70 unique

claims we compile and evaluate 5.4k RTE pairs, while continue

automatizing the workflow to generate similar-sized datasets in

other languages.

VoxML: A Visualization Modeling LanguageJames Pustejovsky and Nikhil Krishnaswamy

We present the specification for a modeling language, VoxML,

which encodes semantic knowledge of real-world objects

represented as three-dimensional models, and of events and

attributes related to and enacted over these objects.VoxML is

intended to overcome the limitations of existing 3D visual markup

languages by allowing for the encoding of a broad range of

semantic knowledge that can be exploited by a variety of systems

and platforms, leading to multimodal simulations of real-world

159

scenarios using conceptual objects that represent their semantic

values

Metonymy Analysis Using Associative Relationsbetween Words

Takehiro Teraoka

Metonymy is a figure of speech in which one item’s name

represents another item that usually has a close relation with the

first one. Metonymic expressions need to be correctly detected

and interpreted because sentences including such expressions

have different mean- ings from literal ones; computer systems

may output inappropriate results in natural language processing.

In this paper, an associative approach for analyzing metonymic

expressions is proposed. By using associative information

and two conceptual distances between words in a sentence, a

previous method is enhanced and a decision tree is trained to

detect metonymic expressions. After detecting these expressions,

they are interpreted as metonymic understanding words by

using associative information. This method was evaluated by

comparing it with two baseline methods based on previous

studies on the Japanese language that used case frames and

co-occurrence information. As a result, the proposed method

exhibited significantly better accuracy (0.85) of determining

words as metonymic or literal expressions than the baselines. It

also exhibited better accuracy (0.74) of interpreting the detected

metonymic expressions than the baselines.

Embedding Open-domain Common-senseKnowledge from Text

Travis Goodwin and Sanda Harabagiu

Our ability to understand language often relies on common-sense

knowledge – background information the speaker can assume

is known by the reader. Similarly, our comprehension of the

language used in complex domains relies on access to domain-

specific knowledge. Capturing common-sense and domain-

specific knowledge can be achieved by taking advantage of recent

advances in open information extraction (IE) techniques and,

more importantly, of knowledge embeddings, which are multi-

dimensional representations of concepts and relations. Building

a knowledge graph for representing common-sense knowledge

in which concepts discerned from noun phrases are cast as

vertices and lexicalized relations are cast as edges leads to

learning the embeddings of common-sense knowledge accounting

for semantic compositionality as well as implied knowledge.

Common-sense knowledge is acquired from a vast collection of

blogs and books as well as from WordNet. Similarly, medical

knowledge is learned from two large sets of electronic health

records. The evaluation results of these two forms of knowledge

are promising: the same knowledge acquisition methodology

based on learning knowledge embeddings works well both

for common-sense knowledge and for medical knowledge

Interestingly, the common-sense knowledge that we have acquired

was evaluated as being less neutral than than the medical

knowledge, as it often reflected the opinion of the knowledge

utterer. In addition, the acquired medical knowledge was

evaluated as more plausible than the common-sense knowledge,

reflecting the complexity of acquiring common-sense knowledge

due to the pragmatics and economicity of language.

Medical Concept Embeddings via LabeledBackground CorporaEneldo Loza Mencía, Gerard de Melo and Jinseok Nam

In recent years, we have seen an increasing amount of interest in

low-dimensional vector representations of words. Among other

things, these facilitate computing word similarity and relatedness

scores. The most well-known example of algorithms to produce

representations of this sort are the word2vec approaches. In

this paper, we investigate a new model to induce such vector

spaces for medical concepts, based on a joint objective that

exploits not only word co-occurrences but also manually labeled

documents, as available from sources such as PubMed. Our

extensive experimental analysis shows that our embeddings lead

to significantly higher correlations with human similarity and

relatedness assessments than previous work. Due to the simplicity

and versatility of vector representations, these findings suggest

that our resource can easily be used as a drop-in replacement

to improve any systems relying on medical concept similarity

measures.

Question-Answering with Logic Specific to VideoGamesCorentin Dumont, Ran Tian and Kentaro Inui

We present a corpus and a knowledge database aiming at

developing Question-Answering in a new context, the open world

of a video game. We chose a popular game called ‘Minecraft’,

and created a QA corpus with a knowledge database related to

this game and the ontology of a meaning representation that

will be used to structure this database. We are interested in

the logic rules specific to the game, which may not exist in the

real world. The ultimate goal of this research is to build a QA

system that can answer natural language questions from players

by using inference on these game-specific logic rules. The QA

corpus is partially composed of online quiz questions and partially

composed of manually written variations of the most relevant

ones. The knowledge database is extracted from several wiki-

like websites about Minecraft. It is composed of unstructured

data, such as text, that will be structured using the meaning

representation we defined, and already structured data such as

160

infoboxes. A preliminary examination of the data shows that

players are asking creative questions about the game, and that the

QA corpus can be used for clustering verbs and linking them to

predefined actions in the game.

P57 - Speech Corpora and Databases (2)Friday, May 27, 14:55

Chairperson: Satoshi Nakamura Poster Session

Mining the Spoken Wikipedia for Speech Dataand Beyond

Arne Köhn, Florian Stegen and Timo Baumann

We present a corpus of time-aligned spoken data of Wikipedia

articles as well as the pipeline that allows to generate such corpora

for many languages. There are initiatives to create and sustain

spoken Wikipedia versions in many languages and hence the data

is freely available, grows over time, and can be used for automatic

corpus creation. Our pipeline automatically downloads and aligns

this data. The resulting German corpus currently totals 293h of

audio, of which we align 71h in full sentences and another 86h of

sentences with some missing words. The English corpus consists

of 287h, for which we align 27h in full sentence and 157h with

some missing words. Results are publically available.

A Corpus of Read and Spontaneous Upper SaxonGerman Speech for ASR Evaluation

Robert Herms, Laura Seelig, Stefanie Münch andMaximilian Eibl

In this Paper we present a corpus named SXUCorpus which

contains read and spontaneous speech of the Upper Saxon German

dialect. The data has been collected from eight archives of

local television stations located in the Free State of Saxony.

The recordings include broadcasted topics of news, economy,

weather, sport, and documentation from the years 1992 to 1996

and have been manually transcribed and labeled. In the paper,

we report the methodology of collecting and processing analog

audiovisual material, constructing the corpus and describe the

properties of the data. In its current version, the corpus is

available to the scientific community and is designed for automatic

speech recognition (ASR) evaluation with a development set and

a test set. We performed ASR experiments with the open-

source framework sphinx-4 including a configuration for Standard

German on the dataset. Additionally, we show the influence of

acoustic model and language model adaptation by the utilization

of the development set.

Parallel Speech Corpora of Japanese DialectsKoichiro Yoshino, Naoki Hirayama, Shinsuke Mori,Fumihiko Takahashi, Katsutoshi Itoyama and Hiroshi G.Okuno

Clean speech data is necessary for spoken language processing,

however, there is no public Japanese dialect corpus collected

for speech processing. Parallel speech corpora of dialect are

also important because real dialect affects each other, however,

the existing data only includes noisy speech data of dialects

and their translation in common language. In this paper, we

collected parallel speech corpora of Japanese dialect, 100 read

speeches utterance of 25 dialect speakers and their transcriptions

of phoneme. We recorded speeches of 5 common language

speakers and 20 dialect speakers from 4 areas, 5 speakers

from 1 area, respectively. Each dialect speaker converted the

same common language texts to their dialect and read them.

Speeches are recorded with closed-talk microphone, using for

spoken language processing (recognition, synthesis, pronounce

estimation). In the experiments, accuracies of automatic speech

recognition (ASR) and Kana Kanji conversion (KKC) system are

improved by adapting the system with the data.

The TYPALOC Corpus: A Collection of VariousDysarthric Speech Recordings in Read andSpontaneous StylesChristine Meunier, Cecile Fougeron, Corinne Fredouille,Brigitte Bigi, Lise Crevier-Buchman, Elisabeth Delais-Roussarie, Laurianne Georgeton, Alain Ghio, ImedLaaridh, Thierry Legou, Claire Pillot-Loiseau and GillesPouchoulin

This paper presents the TYPALOC corpus of French Dysarthric

and Healthy speech and the rationale underlying its constitution.

The objective is to compare phonetic variation in the speech of

dysarthric vs. healthy speakers in different speech conditions

(read and unprepared speech). More precisely, we aim to

compare the extent, types and location of phonetic variation

within these different populations and speech conditions. The

TYPALOC corpus is constituted of a selection of 28 dysarthric

patients (three different pathologies) and of 12 healthy control

speakers recorded while reading the same text and in a more

natural continuous speech condition. Each audio signal has been

segmented into Inter-Pausal Units. Then, the corpus has been

manually transcribed and automatically aligned. The alignment

has been corrected by an expert phonetician. Moreover, the corpus

benefits from an automatic syllabification and an Automatic

Detection of Acoustic Phone-Based Anomalies. Finally, in order

to interpret phonetic variations due to pathologies, a perceptual

161

evaluation of each patient has been conducted. Quantitative data

are provided at the end of the paper.

A Longitudinal Bilingual Frisian-Dutch RadioBroadcast Database Designed for Code-SwitchingResearch

Emre Yilmaz, Maaike Andringa, Sigrid Kingma, JelskeDijkstra, Frits Van der Kuip, Hans Van de Velde, FrederikKampstra, Jouke Algra, Henk van den Heuvel and Davidvan Leeuwen

We present a new speech database containing 18.5 hours of

annotated radio broadcasts in the Frisian language. Frisian is

mostly spoken in the province Fryslan and it is the second official

language of the Netherlands. The recordings are collected from

the archives of Omrop Fryslan, the regional public broadcaster

of the province Fryslan. The database covers almost a 50-year

time span. The native speakers of Frisian are mostly bilingual

and often code-switch in daily conversations due to the extensive

influence of the Dutch language. Considering the longitudinal

and code-switching nature of the data, an appropriate annotation

protocol has been designed and the data is manually annotated

with the orthographic transcription, speaker identities, dialect

information, code-switching details and background noise/music

information.

The SI TEDx-UM speech database: a newSlovenian Spoken Language Resource

Andrej Zgank, Mirjam Sepesy Maucec and DarinkaVerdonik

This paper presents a new Slovenian spoken language resource

built from TEDx Talks. The speech database contains 242 talks

in total duration of 54 hours. The annotation and transcription of

acquired spoken material was generated automatically, applying

acoustic segmentation and automatic speech recognition. The

development and evaluation subset was also manually transcribed

using the guidelines specified for the Slovenian GOS corpus.

The manual transcriptions were used to evaluate the quality of

unsupervised transcriptions. The average word error rate for

the SI TEDx-UM evaluation subset was 50.7%, with out of

vocabulary rate of 24% and language model perplexity of 390.

The unsupervised transcriptions contain 372k tokens, where 32k

of them were different.

Speech Corpus Spoken by Young-old, Old-old andOldest-old Japanese

Yurie Iribe, Norihide Kitaoka and Shuhei Segawa

We have constructed a new speech data corpus, using the

utterances of 100 elderly Japanese people, to improve speech

recognition accuracy of the speech of older people. Humanoid

robots are being developed for use in elder care nursing homes.

Interaction with such robots is expected to help maintain the

cognitive abilities of nursing home residents, as well as providing

them with companionship. In order for these robots to interact

with elderly people through spoken dialogue, a high performance

speech recognition system for speech of elderly people is needed.

To develop such a system, we recorded speech uttered by 100

elderly Japanese, most of them are living in nursing homes,

with an average age of 77.2. Previously, a seniors’ speech

corpus named S-JNAS was developed, but the average age of

the participants was 67.6 years, but the target age for nursing

home care is around 75 years old, much higher than that of the

S-JNAS samples. In this paper we compare our new corpus with

an existing Japanese read speech corpus, JNAS, which consists of

adult speech, and with the above mentioned S-JNAS, the senior

version of JNAS.

Polish Rhythmic Database – New Resources forSpeech Timing and Rhythm AnalysisAgnieszka Wagner, Katarzyna Klessa and Jolanta Bachan

This paper reports on a new database – Polish rhythmic

database and tools developed with the aim of investigating timing

phenomena and rhythmic structure in Polish including topics such

as, inter alia, the effect of speaking style and tempo on timing

patterns, phonotactic and phrasal properties of speech rhythm

and stability of rhythm metrics. So far, 19 native and 12 non-

native speakers with different first languages have been recorded.

The collected speech data (5 h 14 min.) represents five different

speaking styles and five different tempi. For the needs of speech

corpus management, annotation and analysis, a database was

developed and integrated with Annotation Pro (Klessa et al., 2013,

Klessa, 2016). Currently, the database is the only resource for

Polish which allows for a systematic study of a broad range of

phenomena related to speech timing and rhythm. The paper

also introduces new tools and methods developed to facilitate the

database annotation and analysis with respect to various timing

and rhythm measures. In the end, the results of an ongoing

research and first experimental results using the new resources are

reported and future work is sketched.

An Extension of the Slovak Broadcast NewsCorpus based on Semi-Automatic AnnotationPeter Viszlay, Ján Staš, Tomáš Koctúr, Martin Lojka andJozef Juhár

In this paper, we introduce an extension of our previously released

TUKE-BNews-SK corpus based on a semi-automatic annotation

scheme. It firstly relies on the automatic transcription of the BN

data performed by our Slovak large vocabulary continuous speech

recognition system. The generated hypotheses are then manually

162

corrected and completed by trained human annotators. The

corpus is composed of 25 hours of fully-annotated spontaneous

and prepared speech. In addition, we have acquired 900 hours

of another BN data, part of which we plan to annotate semi-

automatically. We present a preliminary corpus evaluation that

gives very promising results.

Generating a Yiddish Speech Corpus, ForcedAligner and Basic ASR System for the AHEYMProject

Malgorzata Cavar, Damir Cavar, Dov-Ber Kerler and AnyaQuilitzsch

To create automatic transcription and annotation tools for the

AHEYM corpus of recorded interviews with Yiddish speakers in

Eastern Europe we develop initial Yiddish language resources that

are used for adaptations of speech and language technologies. Our

project aims at the development of resources and technologies

that can make the entire AHEYM corpus and other Yiddish

resources more accessible to not only the community of Yiddish

speakers or linguists with language expertise, but also historians

and experts from other disciplines or the general public. In

this paper we describe the rationale behind our approach, the

procedures and methods, and challenges that are not specific to

the AHEYM corpus, but apply to all documentary language data

that is collected in the field. To the best of our knowledge, this is

the first attempt to create a speech corpus and speech technologies

for Yiddish. This is also the first attempt to work out speech and

language technologies to transcribe and translate a large collection

of Yiddish spoken language resources.

163

Authors Index

AA R, Balamurali, 153Abad, Alberto, 4, 134, 138Abanmy, Nora, 142Abbas, Noorhan, 62Abbott, Rob, 154Abdelali, Ahmed, 12Abdulrahim, Dana, 148Abercrombie, Gavin, 20Abouammoh, Murad, 145Abouda, Lotfi, 131Abromeit, Frank, 154Acar, Elif Ahsen, 125Ackermann, Markus, 144Adda, Gilles, 49Adda-Decker, Martine, 112Adeel Nawab, Rao Muhammad, 63Adesam, Yvonne, 105Adolphs, Peter, 39Adouane, Wafia, 94Afantenos, Stergos, 36, 95Afli, Haithem, 34Aga, Rosa Tsegaye, 72Agic, Željko, 148Agirre, Eneko, 14, 30, 59, 97, 105Agnès, Frédéric, 144Agosti, Maristella, 155Ah-Pine, Julien, 81Aichinger, Philipp, 27Aizawa, Akiko, 132, 139Aizawa, Masao, 153Ajili, Moez, 25Akarun, Lale, 48Aker, Ahmet, 108, 145Akhtar, Md Shad, 94Al shargi, Faisal, 45Al Zaatari, Ayman, 152Al-Badrashiny, Mohamed, 140, 146Al-Dayel, Abeer, 142Al-Khalifa, Hend, 142Al-Khalil, Muhamed, 149Al-Sulaiti, Latifa, 62Al-Twairesh, Nora, 142Al-Yahya, Maha, 142Alageel, Sinaa, 142

Alagic, Domagoj, 58Alam, Firoj, 99Alba Castro, José Luis, 49Albogamy, Fahad, 52Aldabe, Itziar, 93Alegria, Iñaki, 14, 37, 77, 102Alex, Beatrice, 136Alghamdi, Ayman, 18, 62AlGhamdi, Fahad, 146Algra, Jouke, 162Alharbi, Ghada, 61Alhelbawy, Ayman, 56Alikaniotis, Dimitrios, 104AlMarwani, Nada, 146Almeida, Hayda, 19Almeida, José João, 93Alonso, Miguel A., 144Alqahtani, Sawsan, 126AlShenaifi, Nouf, 142Altuna, Begoña, 152Alvarez, Aitor, 106Alves, Ana, 118Aman, Frédéric, 48, 68Aman, Frederic, 52Amanova, Dilafruz, 4Amaral, Daniela, 71Amilevicius, Darius, 88Amitabh, Unnayan, 68Amsler, Michael, 100Anand, Pranav, 154Ananiadou, Sophia, 44, 63Andersson, Linda, 144Andersson, Marta, 61Andriamakaoly, Jérémy, 70Andringa, Maaike, 162Andrzejczuk, Anna, 91Anikina, Tatjana, 121Antoine, Jean-Yves, 131António Rodrigues, João, 21, 96Antonitsch, André, 71Antunes, Sandra, 112Anwar, Maaz, 82, 159Apidianaki, Marianna, 39Aranberri, Nora, 65, 102, 105Arauco, Alejandro, 78

164

Araujo, Lourdes, 35Arcan, Mihael, 2, 20Archer, Dawn, 91Ariga, Michiaki, 85Arimoto, Yoshiko, 75, 139Arndt, Natanael, 31Arndt, Timotheus, 31Aroyo, Lora, 41, 74, 137Arppe, Antti, 112Arsevska, Elena, 118Artola, Xabier, 140, 156Artstein, Ron, 71, 109Arzelus, Haritz, 106Asahara, Masayuki, 57Asano, Hisako, 22Asher, Nicholas, 36, 95Aslam, Saba, 28Asooja, Kartik, 15Athanasakou, Vasiliki, 63Attardi, Giuseppe, 58Attia, Mohammed, 124Atwell, Eric, 18, 62Auberge, Veronique, 52Aufrant, Lauriane, 53Augenstein, Isabelle, 159Augustinus, Liesbeth, 23, 123Auzina, Ilze, 27, 89Avgustinova, Tania, 145Avramidis, Eleftherios, 65Aziz, Wilker, 142Azpeitia, Andoni, 122

BBabych, Bogdan, 127Bachan, Jolanta, 162Baeza-Yates, Ricardo, 33Baisa, Vít, 29, 30, 97Balahur, Alexandra, 40Baldwin, Timothy, 10Balenciaga, Marina, 106Bali, Kalika, 57Banea, Carmen, 129Banjade, Rajendra, 42, 130Banski, Piotr, 98, 124Baptista, Jorge, 135Barackman, Casey, 36Barancikova, Petra, 123Barbagli, Alessia, 4Barbieri, Francesco, 137

Barbu Mititelu, Verginica, 87Bargmann, Sascha, 80Barker, Emma, 108Barras, Claude, 11, 49Barreaux, Sabine, 66Barreiro, Anabela, 44Bartie, Phil, 75Bartolini, Roberto, 88Bartosiak, Tomasz, 47, 91Barzdins, Guntis, 62Basile, Angelo, 98Basili, Roberto, 2Batanovic, Vuk, 93Bateman, Leila, 73Batista, Fernando, 134Batliner, Anton, 46Battistelli, Delphine, 131Baumann, Timo, 161Baumgartner Jr., William A., 97Baur, Claudia, 8Bayol, Clarisse, 52Bayyr-ool, Aziyana, 89Béchet, Frédéric, 36Bechet, Frederic, 153Becker, Alex, 50Bedjeti, Adriatik, 48Bedrick, Steven, 118Begum, Rafiya, 57Behera, Pitambar, 51Beijer, Lilian, 28Bejcek, Eduard, 18, 80Bekavac, Marko, 105Bekkadja, Slima, 48Bel, Núria, 31, 77, 96Bel, Nuria, 158Bell, Dane, 6, 103Bellot, Patrice, 126Beloki, Zuhaitz, 140, 156Beltrami, Daniela, 72Ben Abacha, Asma, 115Ben Jannet, Mohamed Ameur, 65Benikova, Darina, 144Benko, Vladimír, 147Bentivogli, Luisa, 122Bentz, Christian, 74Berard, Alexandre, 145Berkling, Kay, 111Bernard, Guillaume, 25

165

Bernotat, Jasmin, 120Bertero, Dario, 18Bertrand, Roxane, 76, 111Besacier, Laurent, 49, 133, 144, 145Besançon, Romaric, 67Beskow, Jonas, 153Bethard, Steven, 118, 131Betz, Simon, 61Bhat, Riyaz Ahmad, 82Bhattacharya, Pushpak, 81, 106Bhattacharyya, Pushpak, 22, 77, 94, 150Bhingardive, Sudha, 81, 106Biagioni, Stefania, 13Bianchi, Francesca, 91Bick, Eckhard, 37Biemann, Chris, 125, 144Bierkandt, Lennart, 132Bies, Ann, 32, 89, 129Bigenzahn, Wolfgang, 27Bigi, Brigitte, 76, 161Billawala, Youssef, 107Bingel, Joachim, 124Bittar, André, 136Bizer, Christian, 12Blache, Philippe, 54, 81Black, Alan W, 119Blain, Frédéric, 78Blanco, Eduardo, 132Bleicken, Julian, 115Bobillier Chaumon, Marc-Eric, 48Bod, Rens, 3Boella, Guido, 29Bogantes, Diana, 78Bonastre, Jean-françois, 25Bond, Francis, 84, 150Bonial, Claire, 137Bonneau, Anne, 46Bontcheva, Kalina, 40, 159Borchmann, Łukasz, 142Bordea, Georgeta, 15Boros, Tiberiu, 87Bosc, Tom, 43Bosco, Cristina, 56, 101Bott, Stefan, 79Bouakaz, Saïda, 48Bouamor, Dhouha, 80, 110Bouamor, Houda, 38, 64, 126Boudin, Florian, 66

Bougouin, Adrien, 66Bouhafs Hafsia, Asma, 41Bouma, Gerlof, 105Bourlon, Antoine, 76Bowden, Kevin, 36, 120Boye, Johan, 132Bozsahin, Cem, 125Braasch, Anna, 30Branco, António, 1, 21, 54, 96, 97, 105Brandes, Jasper, 100Brasoveanu, Adrian, 116Braunger, Patricia, 26Bredin, Hervé, 11, 49Brierley, Claire, 18, 62Bristot, Antonella, 71Broadwell, George Aaron, 39Brognaux, Sandrine, 134Brugman, Hennie, 44Brümmer, Martin, 116Bruneau, Pierrick, 11, 49Brunson, Mary, 146Buchner, Karolina, 107Budnik, Mateusz, 49Budzynska, Katarzyna, 135Buitelaar, Paul, 15, 20, 84Bunt, Harry, 110Burchardt, Aljoscha, 1, 65Burga, Alicia, 71Burghardt, Manuel, 70Burgos, Pepi, 111Burkhardt, Felix, 26Buscaldi, Davide, 128Busso, Lucia, 92Buttery, Paula, 74, 104

CCabeza-Pereiro, María del Carmen, 49Cabrio, Elena, 43Caines, Andrew, 74, 104Cajal, Sergio, 71Cakmak, Huseyin, 76Calixto, Iacer, 65Calvo, Arturo, 56Calzà, Laura, 72Calzolari, Nicoletta, 88, 156Camacho-Collados, José, 59Camelin, Nathalie, 11Camgöz, Necati Cihan, 48Campbell, Nick, 21, 119, 154

166

Campillos Llanos, Leonardo, 80, 110Campos, Marisa, 54Candeias, Sara, 27, 50Candito, Marie, 82, 131Capka, Tomáš, 88Cardeñoso-Payo, Valentín, 73Cardoso, Aida, 47Carl, Michael, 139Carlini, Roberto, 80Carlmeyer, Birte, 120Carlotto, Talvany, 140Carman, Mark James, 77Caroli, Frederico Tommasi, 72Carpenter, Jordan, 129Carrive, Jean, 70Carvalho, Paula, 44Caselli, Tommaso, 14, 41, 121, 137Cassidy, Steve, 156Castellucci, Giuseppe, 2Castilho, Sheila, 11Castillo, Carlos, 57Cavar, Damir, 138, 155, 163Cavar, Malgorzata, 138, 155, 163Cavazza, Marc, 60Cavicchio, Federica, 71Cebovic, Ines, 147Celebi, Arda, 104Celli, Fabio, 99Celorico, Dirce, 27Cermáková, Anna, 88Cerrato, Loredana, 21Çetinoglu, Özlem, 146Cettolo, Mauro, 122Chakrabarty, Abhisek, 89Chakraborty, Nilesh, 144Chalub, Fabricio, 31Chamberlain, Jon, 71Chanfreau, Agustin, 17Chang, Angel, 30Chang, Chung-Ning, 120Charlet, Delphine, 70Charnois, Thierry, 128Charton, Eric, 19Chaturvedi, Akshay, 89Chavernac, David, 118Che, Xiaoyin, 23Chen, Francine, 104Chen, Hsin-Hsi, 8, 36, 43

Chen, Huan-Yuan, 36Chen, Jiajun, 23Chen, Lei, 66Chen, Xi, 129Chen, Yan-Ying, 104Chen, Yun-Nung, 26, 109Cherry, Colin, 136Chiarcos, Christian, 51, 84, 141, 154Chiu, Billy, 133Chiu, Tin-Shing, 158Chlumská, Lucie, 88Cho, Kit, 39, 130Chodroff, Eleanor, 46Choi, Eunsol, 15Choi, Ho-Jin, 12Choi, Key-Sun, 12Cholakov, Kostadin, 1Chollet, Mathieu, 17Chorianopoulou, Arodami, 4Choudhury, Monojit, 57Choukri, Khalid, 16, 56, 156, 157Chowdhury, Shammur Absar, 5Christensen, Heidi, 69Christodoulopoulos, Christos, 141Chu, Chenhui, 22, 76, 102Cieri, Christopher, 16, 56, 73, 86, 87, 157Cimiano, Philipp, 84, 120, 121Cinkova, Silvie, 6, 29, 30, 138Ciobanu, Alina Maria, 114Ciravegna, Fabio, 78Claessen, Koen, 24Clare, Amanda, 60Claveau, Vincent, 41, 128Clematide, Simon, 34Cleve, Anthony, 147Cnossen, Fokie, 109Codina-Filba, Joan, 71Cohan, Arman, 28Cohen, K. Bretonnel, 97Coheur, Luisa, 10Cohn, Trevor, 151Collins, Kathryn J., 5Collovini, Sandra, 66, 71Colotte, Vincent, 46Conger, Kathryn, 32Cook, Paul, 10, 103Copestake, Ann, 43Corcoglioniti, Francesco, 31

167

Cordeiro, Silvio, 42Corrales-Astorgano, Mario, 73Correia, Rui, 10, 135Costa, Angela, 10Couillault, Alain, 55, 131Courtin, Antoine, 105Coutinho, Eduardo, 46Couto-Vale, Daniel, 125Crevier-Buchman, Lise, 161Croce, Danilo, 2Cruz, Hilaria, 138Cuadros, Montse, 3Cuba Gyllensten, Amaru, 12Cucchiarini, Catia, 28, 111Cucurullo, Sebastiana, 114Cunningham, Stuart, 69Curto, Pedro, 134Cvrcek, Václav, 88Cysouw, Michael, 83

Dda Costa Pereira, Célia, 10da Silva, João Carlos Pereira, 72Dabre, Raj, 102Daelemans, Walter, 56, 143Dagan, Ido, 109Dai, Xin-Yu, 23Daiber, Joachim, 23Daille, Béatrice, 41, 108, 151Daille, Beatrice, 66Damnati, Geraldine, 70Danforth, Douglas, 110Danieli, Morena, 153Daris, Roberts, 89Darwish, Kareem, 37Das, Amitava, 64Dash, Arnab, 109David, Jérôme, 84Dayrell, Carmen, 91de Carvalho, Rita, 54De Clercq, Orphee, 101de Juan, Paloma, 17De Kuthy, Kordula, 136de Marneffe, Marie-Catherine, 57de Melo, Gerard, 84, 160de Montcheuil, Gregoire, 54de Paiva, Valeria, 31de Ruiter, Laura, 61De Smedt, Koenraad, 80, 123

de Weerd, Harmen, 109Declerck, Thierry, 84, 159Dediu, Dan, 68Degaetano-Ortlieb, Stefania, 67Del Gratta, Riccardo, 88, 156del Pozo, Arantza, 106Del Tredici, Marco, 158Delais-Roussarie, Elisabeth, 161Deléglise, Paul, 36Dell’Orletta, Felice, 4Delli Bovi, Claudio, 59Demberg, Vera, 36Dembowski, Julia, 33Demir, Hakan, 19Demner-Fushman, Dina, 115, 130Demuynck, Kris, 133Den, Yasuharu, 153Denk-Linnert, Doris-Maria, 27Derczynski, Leon, 9, 128Derval, Mathieu, 70Deulofeu, José, 79DeVault, David, 5Dhuliawala, Shehzaad, 150Di Buccio, Emanuele, 155di Buono, Maria Pia, 25Di Caro, Luigi, 29Di Nunzio, Giorgio Maria, 155Diab, Mona, 124, 126, 140, 146Dias Cardoso, Pedro, 20Diaz, Alberto, 14Dick, Melanie, 99Dick, Michelle, 120Diewald, Nils, 124Dijkstra, Jelske, 162Dima, Emanuel, 86Dimitrov, Stefan, 7Dimitrova, Vanya, 154Dimou, Athanasia-Lida, 120Dinarelli, Marco, 20Dini, Luca, 136Dinu, Liviu P., 114, 117, 145DiPersio, Denise, 56, 86Dirix, Peter, 23Djemaa, Marianne, 131Do, Hyun-Woo, 12Dobrovoljc, Kaja, 54Doi, Syunya, 66Dojchinovski, Milan, 115, 116, 144

168

Dragoni, Mauro, 10Dras, Mark, 142Draxler, Christoph, 28, 134Drumond, Lucas, 72Druskat, Stephan, 132, 155Du, Jinhua, 1, 77Dubuisson Duplessis, Guillaume, 95Duclot, William, 68Dufour, Barbara, 118Duma, Daniel, 60Dumitrescu, Stefan Daniel, 87Dumont, Corentin, 160Dupont, Stéphane, 76Dutoit, Thierry, 76Dyer, Chris, 115Dyvik, Helge, 123

EEckart de Castilho, Richard, 29Eckart, Thomas, 97Ecker, Brian, 154Ecker, Stefan, 59Eckert, Kai, 12Eckle-Kohler, Judith, 74Edlund, Jens, 156Efthimiou, Eleni, 120Eger, Steffen, 53Ehrmann, Maud, 19, 116Eibl, Maximilian, 161Eichler, Kathrin, 117Eiselen, Roald, 24, 116Ekbal, Asif, 94, 130Ekenel, Hazim, 49El Ballouli, Rim, 152El Haddad, Kevin, 76El-Beltagy, Samhaa R., 101El-Haj, Mahmoud, 9, 63, 91El-Hajj, Wassim, 152ELbassouni, Shady, 152Elhadad, Michael, 150Elingui, Uriel Pascal, 133Ellendorff, Tilia, 129Elliott, Desmond, 48, 106Emerson, Guy, 43Emmery, Chris, 143Engelmann, Kai Frederic, 50, 120Enström, Ingegerd, 7Erdmann, Johnsey, 110Eriksson, Robin, 62

Erjavec, Tomaž, 53, 125Erro, Daniel, 26Escudero-Mancebo, David, 73Eshkol, Iris, 131Eshkol-Taravela, Iris, 130Eskander, Ramy, 45, 140Eskenazi, Maxine, 135España-Bonet, Cristina, 102Espinosa Anke, Luis, 80, 115Espinoza, Fredrik, 12Esplà-Gomis, Miquel, 103Estève, Yannick, 11, 36Etchegoyhen, Thierry, 122Etcheverry, Mathias, 127Etxeberria, Izaskun, 37Euzenat, Jérôme, 84Eyssel, Friederike, 120

FFaessler, Erik, 87Fairon, Cédrick, 8Falala, Sylvain, 118Falk, Ingrid, 42, 83Fandrych, Christian, 10Fang, Alex, 110Farah, Benamara, 95Farajian, M. Amin, 122Faralli, Stefano, 12Farzand, Omer, 28Fatema, Kaniz, 56Fäth, Christian, 154Fauth, Camille, 46Favre, Benoit, 11, 15, 153Fawei, Biralatei, 117Fazly, Afsaneh, 103Federico, Marcello, 122Feldman, Laurie, 39Fellbaum, Christiane, 39Feltracco, Anna, 74Ferguson, Emily, 73Fernández Barrera, Meritxell, 157Fernandez Rei, Elisa, 68, 134Fernandez, Raquel, 5Ferreira, Eduardo, 127Ferreira, Jaime, 134Ferrero, Jérémy, 144Ferret, Olivier, 67Ferrugento, Adriana, 118Figueira, Anny, 71

169

Finatto, Maria José Bocorny, 92Finch, Andrew, 55Fisas, Beatriz, 107Fischer, Andrea, 145Fischer, Stefan, 124Fišer, Darja, 125Flickinger, Dan, 138Flores-Lucas, Valle, 73Fohr, Dominique, 46, 133Fokkens, Antske, 41, 156Fomicheva, Marina, 96Fonseca, Evandro, 6, 71Forkel, Robert, 83Forsberg, Markus, 90Fort, Karën, 55Foster, Jonathan, 108Foster, Simon, 129Fothergill, Richard, 10Fotinea, Stavroula–Evita, 120Foucault, Nicolas, 105Fougeron, Cecile, 161Fournier, Sebastien, 126Fox Tree, Jean, 110, 120Fox Tree, Jean, 121Frain, Alice, 143Francisco, Virginia, 83Francois, Thomas, 8, 134Francopoulo, Gil, 12, 49, 65Frank, Anette, 105Frankenberg, Claudia, 25Fredouille, Corinne, 69, 161Freitag, Dayne, 72Freitas, André, 72Freitas, Bianca, 152Freitas, Cláudia, 152Freitas, Maria João, 47Frick, Elena, 10, 98Fried, Daniel, 103Frieder, Ophir, 104Frontini, Francesca, 33, 88Füchsel, Silke, 4Fujita, Akira, 96Fulgoni, Dean, 129Funakoshi, Kotaro, 110Fünfer, Sarah, 64Fung, Pascale, 18, 68Funk, Adam, 15, 108Furrer, Lenz, 34

GGábor, Kata, 128Gabryszak, Aleksandra, 84Gagliardi, Gloria, 72Gaizauskas, Robert, 15, 106, 108Galibert, Olivier, 65Galvan, Paloma, 83Gamallo, Pablo, 102Gambäck, Björn, 64Ganguly, Debasis, 65Ganguly, Niloy, 57Ganzeboom, Mario, 28Gao, Jie, 78Garain, Utpal, 89García Mateo, Carmen, 49García-Mateo, Carmen, 68, 134García Pablos, Aitor, 3Garcia, Marcos, 116García-Miguel, José M., 49Garnier, Marie, 7Gaspari, Federico, 157Gast, Volker, 132, 155Gaudio, Rosa, 1Gauthier, Elodie, 133Geoffrois, Edouard, 9Georg, Gersende, 60Georgeton, Laurianne, 161Georgiladakis, Spiros, 42Gerlach, Johanna, 8Ghaddar, Abbas, 5Ghannay, Sahar, 11Ghidoni, Enrico, 72Ghio, Alain, 161Ghoneim, Mahmoud, 124, 126, 146Giannini, Silvia, 13Gibbon, Dafydd, 113Gilmartin, Emer, 154Ginter, Filip, 57, 82Ginzburg, Jonathan, 61Girard-Rivier, Maxence, 52Gkatzia, Dimitra, 75Glaser, Elvira, 140Gleim, Rüdiger, 53Gobert, Maxime, 147Godfrey, John, 46Goeuriot, Lorraine, 67Goggi, Sara, 13Goharian, Nazli, 28, 104

170

Gokcen, Ajda, 110Goldberg, Yoav, 57Gomes, Luís, 78, 97Gómez Guinovart, Xavier, 93Gomez, Randy, 76Gómez-Rodríguez, Carlos, 144Gonçalo Oliveira, Hugo, 102, 118, 150Gonçalves, Anabela, 112González Saavedra, Berta, 90Gonzàlez, Meritxell, 16González-Ferreras, César, 73Goodman, Michael Wayne, 43Goodwin, Travis, 160Gorisch, Jan, 111Gornostaja, Tatjana, 144Gosko, Didzis, 62Götze, Jana, 132Goulas, Theodore, 120Goutte, Cyril, 62Goyal, Kartik, 115Grabar, Natalia, 92, 130Gracia, Jorge, 31, 84Graff, David, 147Graham, Calbert, 74Gralinski, Filip, 142Granvogl, Daniel, 70Green, Phil, 69Greenwood, Mark A., 128Grefenstette, Gregory, 47Griffitt, Kira, 129Grimes, Stephen, 32, 102Grishman, Ralph, 20Grouas, Thibault, 157Grouin, Cyril, 70, 124, 141Grover, Claire, 136Grzitis, Normunds, 89Guerraz, Aleksandra, 70Guillou, Erwan, 48Guillou, Liane, 22Gulordava, Kristina, 99Gupta, Palash, 6Gurevych, Iryna, 29, 32, 74, 105Gurrutxaga, Antton, 113Gustafson, Joakim, 156Gutiérrez-González, Yurena, 73Gutierrez-Vasques, Ximena, 146Gutkin, Alexander, 69

HH. Arai, Noriko, 96Ha, Linne, 69Haaf, Susanne, 151Habash, Nizar, 38, 45, 64, 126, 140, 148, 149,

152Habernal, Ivan, 32, 74HaCohen-Kerner, Yaakov, 19Hagen, Kristin, 51Hagmüller, Martin, 27Hahn, Udo, 87, 88Hahn-Powell, Gus, 6, 11Hain, Thomas, 61, 69Hajic, Jan, 55, 57, 105, 138, 149Hajj, Hazem, 152Hajnicz, Elzbieta, 47, 91Hakkani-Tur, Dilek, 26Halabi, Nawar, 25Halfaker, Aaron, 45Hamfors, Ola, 12Hamon, Thierry, 92Han, Jingyi, 77Han, Qi, 42Hanbury, Allan, 144Handschuh, Siegfried, 72Hangya, Viktor, 100Hanke, Thomas, 115Hanl, Michael, 124Hansen, Dorte Haltrup, 87Hantke, Simone, 46, 75Harabagiu, Sanda, 160Harashima, Jun, 74, 85Hardmeier, Christian, 22Harige, Ravindra, 84Hartmann, Silvana, 105Hasanuzzaman, Mohammed, 130Hasida, Koiti, 141Hassan, Sara, 148Hateva, Neli, 27Hathout, Nabil, 38, 47Hätty, Anna, 79Haugereid, Petter, 123Hawwari, Abdelati, 124, 126, 146Hayakawa, Akira, 21Hayashi, Yoshihiko, 43, 91Hayoun, Avi, 150Hazem, Amir, 108, 145He, Yifan, 20

171

He, Yulan, 143Hedberg, Karin, 105Hedeland, Hanna, 10Heid, Ulrich, 99Hellmann, Sebastian, 84, 116, 144Hellrich, Johannes, 87Hendrickx, Iris, 1Hendrikx, Pascal, 118Hennig, Leonhard, 84, 117Henriksen, Lina, 16Hensler, Andrea, 30Hepple, Mark, 108, 151Hermann, Thomas, 120Herms, Robert, 161Hernaez, Inma, 26Hernández Farías, Delia Irazú, 101Hernandez Pompa, Isaac, 146Hernandez, Nicolas, 61Hernando, Javier, 49Hersh, William, 118Hervas, Raquel, 14, 83Hicks, Davyth, 113Higashinaka, Ryuichiro, 110Hirayama, Naoki, 161Hládek, Daniel, 66Hladka, Barbora, 83Hnátková, Milena, 88Hoenen, Armin, 73, 148Hofmann, Hansjörg, 26Hohle, Petter, 55Hokamp, Chris, 127Hollenstein, Nora, 137Hollink, Laura, 48Holst, Anders, 12Holthaus, Patrick, 50, 120Homburg, Timo, 141Hongchao, Liu, 159Hönig, Florian, 46Horbach, Andrea, 7, 30, 59Horsmann, Tobias, 148Horvat, Matic, 15, 43Hoste, Véronique, 12Hoste, Veronique, 62, 101Hough, Julian, 5, 61Hovy, Dirk, 104Hovy, Eduard, 45Htait, Amal, 126Hu, Junfeng, 51

Hu, Zhichao, 120Hua, Zhenhao, 109Huang, Chu-Ren, 79, 139, 158, 159Huang, Hen-Hsen, 36Huang, Shujian, 23Huangfu, Luwen, 103Hubert, Isabell, 112Huck, Matthias, 1Huet, Stéphane, 126Hulden, Mans, 37, 90Humayoun, Muhammad, 28, 128Hunter, Julie, 95Hupkes, Dieuwke, 3Husic, Halima, 98Huygen, Paul, 156

IIde, Nancy, 16, 87Idiart, Marco, 80Ijuin, Koki, 147Iliakopoulou, Aikaterini, 17Iliash, Anna, 10Ilievski, Filip, 19, 151Illina, Irina, 133Imada, Takakazu, 66Imran, Muhammad, 57Inaba, Michimasa, 110Indig, Balázs, 84Inel, Oana, 121, 137Inoue, Masashi, 95Inoue, Yusuke, 66Inui, Kentaro, 160Ioki, Masayuki, 85Iosif, Elias, 4, 42, 100, 137Iribe, Yurie, 162Irimia, Elena, 87Isahara, Hitoshi, 77Isard, Amy, 60Ishida, Mitsuru, 147Ishida, Toru, 114, 154Itoyama, Katsutoshi, 161Ivanova, Angelina, 138Izquierdo, Ruben, 59

JJabaian, Bassam, 126Jackl, Bernhard, 134Jacquet, Guillaume, 19Jacquey, Evelyne, 151

172

Jadi, Grégoire, 41Jaffe, Evan, 110Jagrova, Klara, 145Jaimes, Alejandro, 17Jain, Rohit, 60Jakubicek, Milos, 97Janier, Mathilde, 35Jansche, Martin, 69Janssen, Maarten, 112, 139Jaquette, Daniel, 86Jauch, Ronny, 99Jazbec, Ivo-Pavao, 148Jean-Louis, Ludovic, 19Jelínek, Tomáš, 88Jeong, Young-Seob, 12Jettka, Daniel, 10Jezek, Elisabetta, 74Jha, Girish, 51Jha, Rahul, 17Ji, Donghong, 30Jiménez, Ricardo-María, 91Jimeno Yepes, Antonio, 102Johannessen, Janne M, 51Johannsen, Anders, 30, 104Johansson, Richard, 94, 105Jones, Dewi Bryn, 113Jones, Gareth, 65Jones, Karen, 147Jonquet, Clement, 58Joo, Won-Tae, 12Joscelyne, Andrew, 16Joshi, Aditya, 77Jouvet, Denis, 46Jügler, Jeanin, 46Juhár, Jozef, 66, 162Juhn, Young, 15Junczys-Dowmunt, Marcin, 122Jung, Manuel, 128Jurgens, David, 7

KKındıroglu, Ahmet Alp, 48Kaalep, Heiki-Jaan, 85Kabadjov, Mijail, 29Kabashi, Besim, 149Kachkovskaia, Tatiana, 67Kahn, Juliette, 25, 65Kalamboukis, Theodore, 13Kameko, Hirotaka, 49

Kaminski, Steve, 86Kamocki, Pawel, 88, 154Kampstra, Frederik, 162Kanayama, Hiroshi, 57Kanojia, Diptesh, 77, 150Kaplan, Aidan, 45Kaplan, Dain, 152Karabüklü, Serpil, 48Karkaletsis, Vangelis, 156Karlgren, Jussi, 12Kashyap, Laxmi, 106Katakis, Ioannis Manousos, 156Katayama, Taichi, 22Katerenchuk, Denys, 127Kato, Akihiko, 58Kato, Tsuneo, 9Kato, Yoshihide, 54Katris, Nikolaos, 13Kattenberg, Mathijs, 156Kawada, Yasuhide, 66Kawasaki, Yoshifumi, 3Keiper, Lena, 7Kelepir, Meltem, 48Kelly, Liadh, 67Kemmerer, Steffen, 39Kemps-Snijders, Marc, 86Kennington, Casey, 5Kepler, Fabio, 50Kerler, Dov-Ber, 163Kermanidis, Katia Lida, 1Kermes, Hannah, 67Kettnerová, Václava, 18Kettunen, Kimmo, 33Khalfi, Mustapha, 33Khalifa, AlBara, 9Khalifa, Salam, 38, 148Khamis, Ashraf, 67Khan, Fahad, 33, 88Khan, R. A., 48Khan, Tafseer Ahmed, 82Khashabi, Daniel, 141Khemakhem, Mohamed, 29Khiari, Wejdene, 41Khudanpur, Sanjeev, 46Khvtisavrishvili, Nana, 79Kieras, Witold, 90Kijak, Ewa, 128Kilicoglu, Halil, 115

173

Kingma, Sigrid, 162Kiritchenko, Svetlana, 2, 40, 136Kirov, Christo, 108, 109Kisler, Thomas, 28, 134Kiss, Tibor, 98Kitaoka, Norihide, 162Klakow, Dietrich, 4, 145Klang, Marcus, 143Klassen, Prescott, 119Klein, Ewan, 60Klejch, Ondrej, 65Klenner, Manfred, 100Kleppe, Martijn, 106Klessa, Katarzyna, 162Kliegr, Tomas, 115Klimek, Bettina, 31, 84Klinger, Roman, 39Kloppenburg, Lennart, 98Klubicka, Filip, 103, 148Klyueva, Natalia, 100Knappen, Jörg, 67Knight, Dawn, 91Knight, Kevin, 15Kobayashi, Yuka, 110Kobourov, Stephen, 103Koch, Steffen, 42Kochanowski, Bartłomiej, 99Kocharov, Daniil, 67Koctúr, Tomáš, 162Kohl, Matt, 16Köhn, Arne, 161Koidl, Kevin, 144Koiso, Hanae, 153Kolcz, Alek, 104Komachi, Mamoru, 13Konat, Barbara, 135Konovalov, Vasily, 109Köper, Maximilian, 42, 90Kordjamshidi, Parisa, 141Kordoni, Valia, 1Korkontzelos, Yannis, 44, 63Kornai, Andras, 98Korpusik, Mandy, 104Köster, Norman, 120Koto, Fajri, 28Kousidis, Spyros, 61Koutsakis, Polychronis, 100Koutsombogera, Maria, 120

Kovár, Vojtech, 14Kováríková, Dominika, 88Krause, Sebastian, 31, 84, 117Krause, Thomas, 155Kraut, Robert, 45Krejcová, Ema, 29, 30Kren, Michal, 88, 91Krenn, Brigitte, 49Krieg-Holz, Ulrike, 88Krilavicius, Tomas, 88Krisch, Jennifer, 99Krishnaswamy, Nikhil, 159Kríž, Vincent, 83Krome, Sabine, 30Krstev, Cvetana, 18Kruschwitz, Udo, 29, 56, 71Ku, Lun-Wei, 94Kuhlmann, Marco, 138Kuhn, Jonas, 6, 149Kuhnle, Alexander, 43Kulick, Seth, 89Kummert, Franz, 120Kunz, Kerstin Anna, 34Kuo, Chung-Lun, 43Kupietz, Marc, 124Kuras, Christoph, 97Kurfalı, Murathan, 125Kurfürst, Dennis, 50Kurohashi, Sadao, 22, 76, 77, 102Kurtic, Emina, 108Kutuzov, Andrey, 106Kuvac Kraljevic, Jelena, 112Kuzmenko, Elizaveta, 106Kyaw Thu, Ye, 55Kyuseva, Maria, 43

LLaaridh, Imed, 69, 161Labaka, Gorka, 77Lachler, Jordan, 112Lafourcade, Mathieu, 158Lai, Mirko, 56Lailler, Carole, 36Lam, Sam, 133Lancelot, Renaud, 118Landeau, Anaïs, 36Lane, Caoilfhionn, 20Lanfrey, Damien, 14Langlais, Phillippe, 5

174

Lanser, Bettina, 121Laparra, Egoitz, 51, 93Laprie, Yves, 46Lapshinova-Koltunski, Ekaterina, 34Laur, Sven, 85Laura, Monceaux, 41Lawrence, John, 135Lazic, Biljana, 18Le, Dieu-Thu, 14Le, Ha, 53Lebani, Gianluca, 33Lecouteux, Benjamin, 48, 68Lee, Annie, 103Lee, John, 37, 58Lefeuvre-Halftermeyer, Anaïs, 131Lefever, Els, 12, 62Lefevre, Fabrice, 126Léger, Serge, 62Legou, Thierry, 161Leichsenring, Christian, 120Lejeune, Gaël, 151Lenci, Alessandro, 33, 76, 92, 158Lendvai, Piroska, 159Leonhard, Matthias, 27Leser, Ulf, 39Lesnikova, Tatiana, 84Letard, Vincent, 95Levchik, Anatolii, 40Levin, Lori, 115Lewis, David, 56Li, Claire, 133Li, Junyi Jessy, 135Li, Minglei, 64, 133Li, Wenjie, 64Li, Xuansong, 32, 102Liakata, Maria, 60, 142Liao, Wan-Shan, 36Liberman, Mark, 73Libovický, Jindrich, 24Liddy, Elizabeth D., 40Liebeskind, Chaya, 19Lien, John, 39Lier, Florian, 120Liew, Jasy Suet Yan, 40Ligozat, Anne-Laure, 8, 80, 95Lim, Chae-Gyun, 12Limburská, Adéla, 45Lin, Donghui, 154

Lison, Pierre, 32List, Johann-Mattis, 83Listenmaa, Inari, 24Littell, Patrick, 115Little, Alexa, 81Liu, Hongfang, 15, 118Liu, Kris, 110, 120, 121Liu, Lin, 56, 156, 157Liu, Qun, 77, 95Liu, Ting, 39, 130Liu, Wuying, 37Liu, Yang, 35Liyanapathirana, Jeevanthi, 78Ljubešic, Nikola, 53, 103, 112, 125, 148Llewellyn, Clare, 136Llozhi, Lorena, 7Loáiciga, Sharid, 99Löfberg, Laura, 91Logacheva, Varvara, 78, 127Loginova Clouet, Elizaveta, 61Lojka, Martin, 162Long, Yunfei, 64Lopes, Carla, 27Lopes, José, 4Lopez de Lacalle, Maddalen, 93Lopez de Lacalle, Oier, 97Lopez, Cédric, 39Losnegaard, Gyri Smørdal, 80, 123Lossio-Ventura, Juan Antonio, 58Loudcher, Sabine, 81Louka, Katerina, 4Loukachevitch, Natalia, 40Lovick, Olga, 114Lowe, John B., 122Loza Mencía, Eneldo, 160Lu, Jing, 138Lu, Qin, 133, 158Lu, Yanan, 30Lubis, Nurul, 76Lucisano, Pietro, 4Luecking, Andy, 50, 148Lukin, Stephanie, 36Lundkvist, Peter, 7Luo, Wentao, 43Lupu, Mihai, 144Lusicky, Vesna, 16Luz, Saturnino, 21Lyding, Verena, 86

175

Lyse, Gunn Inger, 123

MMaamouri, Mohamed, 32Machado, Gabriel, 66Maciejewski, Matthew, 46Mackaness, William, 75Maegaard, Bente, 16Magnani, Romain, 52Magnini, Bernardo, 74Magnolini, Simone, 74Maharjan, Nabin, 42Mahlow, Cerstin, 100Maier, Wolfgang, 144Makrai, Márton, 96Maks, Isa, 41Malchanau, Andrei, 109, 110Malcuori, Marisa, 72Maldonado, Alfredo, 56Malmasi, Shervin, 62, 142Mamede, Nuno, 135Mamprin, Sara, 59Manishina, Elena, 126Mankoff, Robert, 17Mannens, Erik, 144Manning, Christopher D., 30, 57, 82Manuvinakurike, Ramesh, 5Mapelli, Valérie, 16, 56, 157Marcello, Norina, 72Marchi, Erik, 75Marciniak, Malgorzata, 79Marcu, Daniel, 15Marecek, David, 4Marg, Lena, 1Margaretha, Eliza, 124Mariani, Joseph, 12, 49, 65Marti, Roland, 145Martin, Fabienne, 42Martin, James H., 122Martínez Alonso, Héctor, 30Martinez Calvo, Adela, 68Martinez Garcia, Eva, 102Martínez Martínez, José Manuel, 78Martinez, Marta, 68, 134Martínez-Hinarejos, Carlos-D., 106Martinez-Romo, Juan, 35Martins de Matos, David, 134Marton, Yuval, 23Massimo, Poesio, 56

Matamala, Anna, 106Matos, Miguel, 138Matsubara, Shigeki, 54Matsumoto, Yuji, 57, 58Matsuo, Yoshihiro, 22Matsuzaki, Takuya, 96Matthies, Franz, 87, 88Maurel, Denis, 131Mauri, Marcel, 50Maxwell, Mike, 157May, Jonathan, 15Maynard, Diana, 40, 128Mazo, Hélène, 16Mazura, Margaretha, 16McCrae, John Philip, 84McDonald, Ryan, 57Medved’, Marek, 97Megyesi, Beata, 111Mehdad, Yashar, 107Mehler, Alexander, 50, 53, 148Meißner, Cordula, 10Meinel, Christoph, 23Melamud, Oren, 109Melero, Maite, 31Melese, Michael, 133Mella, Odile, 46Melo, Luis Felipe, 151Mendes, Amália, 112Mendes, Pablo, 151Mendez, Gonzalo, 83Menini, Stefano, 15Metaxas, Dimitris, 107Meunier, Christine, 69, 161Meurant, Laurence, 147Meurer, Paul, 123Meurers, Detmar, 136Meurs, Marie-Jean, 19Meusel, Robert, 12Meyer zu Borgsen, Sebastian, 120Michelfeit, Jan, 97Mihalcea, Rada, 129Miháltz, Márton, 84Mihov, Stoyan, 27Mikulová, Marie, 6Milicevic, Maja, 159Miller, Tristan, 29Milosavljevic, Milan, 93Minard, Anne-Lyse, 51, 152

176

Minker, Wolfgang, 3, 63Mírovský, Jirí, 6, 61Mirzaei, Azadeh, 132Mirzaei, Mehrdad, 130Misra Sharma, Dipti, 6Mitankin, Petar, 27Mitkov, Ruslan, 10, 17Mitra, Prasenjit, 57Miwa, Makoto, 44Miyao, Yusuke, 53, 57, 132, 138Möbius, Bernd, 46Mociariková, Monika, 14Modi, Ashutosh, 121Moe, Lwin, 155Moens, Marie-Francine, 49Mohammad, Saif, 2, 40, 136Mohit, Behrang, 64Mohler, Michael, 146Moisik, Scott, 68Mojica de la Vega, Luis Gerardo, 152Mokaddem, Sidahmed, 2Moloodi, Amirsaeid, 132Monachini, Monica, 33, 88Moniz, Helena, 4, 134Montcheuil, Grégoire, 81Montemagni, Simonetta, 4, 114Monti, Johanna, 80Moore, Andrew, 63Moran, Steven, 84, 153Morante, Roser, 41Moreira, André, 86Morency, Louis-Philippe, 17Moretti, Giovanni, 14Morey, Mathieu, 95Morgado da Costa, Luís, 150Mori, Hiroki, 139Mori, Shinsuke, 23, 47, 49, 57, 161Morin, Emmanuel, 145Morlane-Hondère, François, 70Morros, Ramon, 49Mortensen, David R., 115Mostafa, Naziba, 68Mota, Cristina, 44Motlani, Raveesh, 90Mott, Justin, 129Mrabet, Yassine, 115Mubarak, Hamdy, 12, 37Mudraya, Olga, 91

Muischnek, Kadri, 54Mujadia, Vandan, 6Mújdricza-Maydt, Éva, 105Müller, Markus, 64Muller, Philippe, 131Münch, Stefanie, 161Murakami, Yohei, 114Murata, Kenta, 85Murawaki, Yugo, 47Muszynska, Ewa, 43Müürisep, Kaili, 54Muzaffar, Sharmin, 51Mykowiecka, Agnieszka, 79

NNabi, Hakim, 70Nagaoka, Atsushi, 139Nagata, Ryo, 3Nahli, Ouafae, 33Nakadai, Kazuhiro, 76Nakaguchi, Takao, 154Nakamura, Keisuke, 76Nakamura, Satoshi, 69, 76Nakazawa, Toshiaki, 76, 77Nam, Jinseok, 160Namer, Fiammetta, 38Naskar, Debashis, 2Naskar, Sudip Kumar, 21Näsman, Jesper, 111Nasr, Alexis, 79Nasution, Arbi Haza, 114Navarretta, Costanza, 17Navarro, Borja, 151Navas, Eva, 26Navigli, Roberto, 59Nawab, Rao Muhammad Adeel, 28, 91Nayak, Tapas, 21Nazar, Rogelio, 52Neale, Steven, 97, 105Nedoluzhko, Anna, 6, 34Neergaard, Karl, 139, 159Neff, Michael, 120, 121Neidle, Carol, 107Nemeskey, Dávid Márk, 98Nenkova, Ani, 135Neubig, Graham, 69Neudecker, Clemens, 150Neumann, Stella, 125Névéol, Aurélie, 102

177

Neves, Mariana, 102Ng, Vincent, 138, 152Ng, Vincent T.Y., 133Nguyen, Kiem-Hieu, 67Nguyen, Ngan, 53Nguyen, Ngoc, 154Nguyen, Quy, 53Ní Chasaide, Ailbhe, 119Ní Chiaráin, Neasa, 119Nicolao, Mauro, 69Nie, Tian, 66Niekrasz, John, 72Niemietz, Paula, 125Nikolic, Boško, 93Nimb, Sanni, 30Niraula, Nobal Bikram, 42Nisioi, Sergiu, 117, 118, 145Nissim, Malvina, 98Niton, Bartłomiej, 47Nivre, Joakim, 54, 57, 82Nixon, Lyndon J.B., 116Noferesti, Samira, 94Nordhoff, Sebastian, 114, 140Nöth, Elmar, 46Nouri, Javad, 108Nouvel, Damien, 116Novák, Attila, 45Novák, Michal, 6Nugues, Pierre, 143

OÓ Droighneáin, Eoin, 20O’Brien, Sharon, 11O’Daniel, Bridget, 135O’Regan, Jim, 154Obeid, Ossama, 64, 126Oberlander, Jon, 136Obradovic, Ivan, 18Odijk, Jan, 86Oellrich, Anika, 142Oepen, Stephan, 138Offersgaard, Lene, 87Oflazer, Kemal, 64, 126Ohta, Tomoko, 132Ohya, Kazushi, 113Okanoya, Kazuo, 75Okuno, Hiroshi G., 161Okur, Eda, 19Olsen, Sussi, 16, 30

Olsson, Fredrik, 12Onaindia, Eva, 2Onambele, Christophe, 90Oostdijk, Nelleke, 35Oramas, Sergio, 115Orasmaa, Siim, 85Oravecz, Csaba, 45Orizu, Udochukwu, 143Ortiz Rojas, Sergio, 103Osella, Michele, 144Osenova, Petya, 80, 84, 105Ostermann, Simon, 121Otegi, Arantxa, 105Otrusina, Lubomir, 114Outahajala, Mohamed, 149Øvrelid, Lilja, 55Özates, Saziye Betül, 99Özbal, Gözde, 131Özgür, Arzucan, 19, 99, 104Özsoy, Ayse Sumru, 48Ozturel, Adnan, 61

PPa Pa, Win, 55Pääkkönen, Tuula, 33Paetzold, Gustavo, 107Paikens, Pteris, 89Pajkossy, Katalin, 148Pal, Santanu, 21Palmér, Anne, 111Palmer, Martha, 32, 82, 137Palmero Aprosio, Alessio, 31Palogiannidi, Elisavet, 4, 100Palotti, Joao, 67Pan, Jeff, 117Panchenko, Alexander, 92Papavassiliou, Vassilis, 32Paperno, Denis, 43Pappu, Aasish, 17Paramita, Monica, 108Pardelli, Gabriella, 13, 88Pareja-Lora, Antonio, 84Pareti, Silvia, 61, 135Parish-Morris, Julia, 73Park, Joonsuk, 135Park, SoHyun, 103Parker, Jonathan, 131Paroubek, Patrick, 12, 65Parra Escartín, Carla, 80

178

Parvizi, Artemis, 16Pasha, Arfath, 140Passaro, Lucia C., 76Passarotti, Marco, 24, 90Patti, Viviana, 56, 101Paulheim, Heiko, 12, 151Pawar, Dipawesh, 130Pedersen, Bolette, 30Pedersen, Ted, 118Peldszus, Andreas, 36Pelemans, Joris, 133Pelletier, Francis Jeffry, 98Perdigão, Fernando, 27Pereira Lopes, Gabriel, 78Pereira, Rita, 105Pérez, Naiara, 122Perret, Jérémy, 36Pershina, Maria, 20Persson, Per, 12Pessentheiner, Hannes, 27Petasis, Georgios, 156Peters, Wim, 13Petkevic, Vladimír, 88Petmanson, Timo, 85Petrov, Slav, 57Petukhova, Volha, 4, 109, 110Piao, Scott, 91Pichler, Thomas, 27Pietquin, Olivier, 145Pilán, Ildikó, 7, 8Pillot-Loiseau, Claire, 161Pincus, Eli, 95Pinkal, Manfred, 30, 121Pinnis, Marcis, 27Pipatsrisawat, Knot, 69Piper, Andrew, 7Piperidis, Stelios, 32Plancq, Clément, 66Plank, Barbara, 56Plu, Julien, 19, 151Podlaska, Katarzyna, 83Poesio, Massimo, 29, 71Pohling, Marian, 120Poibeau, Thierry, 66Poignant, Johann, 11, 49Poláková, Lucie, 61Poletto, Cecilia, 155Polzehl, Tim, 74

Ponti, Edoardo Maria, 24Ponzetto, Simone Paolo, 12Pool, Jonathan, 84Popel, Martin, 65, 105Popescu, Octavian, 117Popescu, Vladimir, 16, 56, 156, 157Popescu-Belis, Andrei, 78Popovic, Maja, 2, 65Poppek, Johanna Marie, 98Pörner, Nina, 28, 134Portet, François, 48, 68Postma, Marten, 59Potamianos, Alexandros, 4, 42, 100, 137Pouchoulin, Gilles, 161Pouli, Vasiliki, 20Pouliquen, Bruno, 122Povlsen, Claus, 16Prabhakaran, Vinodkumar, 70Prange, Jakob, 30Preotiuc-Pietro, Daniel, 129, 151Pretkalnina, Lauma, 89Prévot, Laurent, 33, 54, 111Procházka, Pavel, 88Proença, Jorge, 27Proisl, Thomas, 149Prokopidis, Prokopis, 32Prys, Delyth, 113Prys, Gruffudd, 113Puolakainen, Tiina, 54Pustejovsky, James, 16, 87, 159Pyysalo, Sampo, 57, 132

QQasemiZadeh, Behrang, 64Qin, Lu, 64Qiu, Zhengwei, 34Quasthoff, Uwe, 14, 97Que, Roger, 109Quénot, Georges, 49Querido, Andreia, 21, 54, 96Quilitzsch, Anya, 163Quochi, Valeria, 113

RRabadan, Adrian, 14Rabinovich, Ella, 145Rademaker, Alexandre, 31Radev, Dragomir, 17, 99, 107Raganato, Alessandro, 59

179

Ramadier, Lionel, 158Rambelli, Giulia, 33Rambow, Owen, 45, 70, 140Ramisch, Carlos, 42, 79Ramsay, Allan, 52Ramshaw, Lance, 32Rauschenberger, Maria, 4Rauzy, Stéphane, 54, 81Ravenscroft, James, 60, 142Ray, Jessica, 11Rayner, Manny, 8Rayson, Paul, 9, 63, 91Read, Jonathon, 60Real, Livy, 31Rebollo, Miguel, 2Recski, Gábor, 91, 98Reddy, Dinesh, 115Redling, Benjamin, 88Reed, Chris, 35, 135Regueira, Xose Luis, 134Rehbein, Ines, 36Rehm, Georg, 55, 85Reichel, Uwe, 28, 134Reichel, Uwe D., 26Rekabsaz, Navid, 144Rello, Luz, 4, 33Remus, Steffen, 125Renals, Steve, 62Renau, Irene, 52Rendeiro, Nuno, 21, 96Renner-Westermann, Heike, 154Rey-Villamizar, Nicolas, 118Reynaert, Martin, 34, 44Ribeiro, Eugénio, 134Ribeiro, Ricardo, 134Ribes-Lafoz, María, 151Ribeyre, Corentin, 123Riccardi, Giuseppe, 5, 99, 153Richardson, John, 49Richart, Cécile, 39Richter, Viktor, 120Rieser, Verena, 75Rigau, German, 3, 51, 59, 93Rikters, Matiss, 21Rinaldi, Fabio, 129Rink, Bryan, 146Rinke, Esther, 155Ritchie, Phil, 144

Rituma, Laura, 89Rizzo, Giuseppe, 19, 151Roberts, Kirk, 115, 130Roche, Mathieu, 41, 58, 118Rodrigues, Filipe, 118Rodríguez, Alejandro, 78Rodríguez, Eric, 78Rodriguez, Kepa, 71Rodriguez, Laritza, 115Rodríguez-Fernández, Sara, 80Rodriguez-Ferreira, Teresa, 14Roesiger, Ina, 6, 60Roesner, Immer, 27Rohwer, Richard, 72Romary, Laurent, 66Ronzano, Francesco, 107, 137Rosá, Aiala, 72Rosén, Victoria, 80, 123Rosenberg, Andrew, 127Rospocher, Marco, 31, 51Rossato, Solange, 25, 48Rosset, Sophie, 49, 65, 80, 95, 110, 116Rossini Favretti, Rema, 72Rosso, Paolo, 149Roth, Dan, 141Roux, Justus, 85Roziewski, Szymon, 97Rozis, Roberts, 44Ruan, Chong, 51Rubens, Neil, 152Rudnicka, Ewa, 83Rudnicky, Alexander, 109Rudra, Koustav, 57Ruiz, Pablo, 66Ruppenhofer, Josef, 100Rus, Vasile, 42, 130Russell, Martin, 8Russo, Irene, 88, 113Ruths, Derek, 7Rychlik, Piotr, 79Rychlý, Pavel, 14Ryzhova, Daria, 43Rzymski, Christoph, 132

SS, Sreelekha, 22Sabetghadam, Serwah, 144Sack, Harald, 115Sadamitsu, Kugatsu, 22

180

Sadeque, Farig, 118Saerens, Marco, 134Saggion, Horacio, 107, 115, 137Saha, Shyamasree, 142Sahlgren, Magnus, 12Saidi, Arash, 51Saint-Dizier, Patrick, 7, 34, 46Saito, Itsumi, 22Sajous, Franck, 47Sakaki, Shigeyuki, 104Sakti, Sakriani, 76Salameh, Mohammad, 2Salchak, Aelita, 89Salden, Uta, 115Salesky, Elizabeth, 11Salim, Soufian, 61Salimbajevs, Askars, 27Salliau, Frank, 144Salloum, Wael, 140Salvetti, Franco, 122Samardzic, Tanja, 140, 159Samier, Quentin, 157Samih, Younes, 144Sammons, Mark, 141San Vicente, Iñaki, 33, 102Sánchez, Noelia, 151Sandell, Monica, 7Sanders, Eric, 103, 111, 113Sangati, Federico, 80, 98Sänger, Mario, 39Santos, Ana Lúcia, 47Santos, Diana, 152Santos, Eddie Antonio, 112Santos, Fábio, 150Santus, Enrico, 158, 159Saralegi, Xabier, 14, 33Sarasola, Kepa, 77Sarasola, Xabier, 26Saraswati, Jaya, 106Saratxaga, Ibon, 26Sarhimaa, Anneli, 113Sasa, Yuko, 52Sasada, Tetsuro, 23, 49Sasaki, Felix, 144Sassolini, Eva, 114Saulite, Baiba, 89Saurí, Roser, 16Savary, Agata, 78, 80, 131

Scarton, Carolina, 126Schäfer, Roland, 155Schang, Emmanuel, 131Scharl, Arno, 116Scheffler, Tatjana, 35Schenner, Mathias, 140Scherer, Stefan, 17Scherrer, Yves, 140Schiel, Florian, 28, 134Schiffhauer, Birte, 120Schlangen, David, 5, 61, 120Schlechtweg, Dominik, 5Schleicher, Thomas, 63Schmidt, Maria, 26Schmidt, Thomas, 10, 52Schmidt-Thieme, Lars, 72Schmitt, Alexander, 3Schneider, Nathan, 137Schneider-Stickler, Berit, 27Schoen, Anneleen, 152Scholman, Merel, 36Scholze-Stubenrecht, Werner, 30Schöne, Karin, 86Schreitter, Stephanie, 49Schröder, Johannes, 25Schuller, Björn, 46, 75Schulte im Walde, Sabine, 42, 79, 90Schultz, Robert T., 73Schultz, Tanja, 25Schulz, Sarah, 149Schulz, Simon, 120Schumann, Anne-Kathrin, 64, 124Schuschnig, Christian, 88Schuster, Sebastian, 82Schuurman, Ineke, 23Schwab, Didier, 144Seara, Roberto, 134Seddah, Djamé, 82, 123Sedlák, Michal, 88Seelig, Laura, 161Segawa, Shuhei, 162Segers, Roxane, 51Segond, Frederique, 39Seibel, Brandon, 103Seitner, Julian, 12Sekulic, Ivan, 93Semenkin, Eugene, 3Sepesy Maucec, Mirjam, 162

181

Seraji, Mojgan, 82Sergienko, Roman, 63Serra, Xavier, 115Serralheiro, António, 138Servan, Christophe, 145Sevcikova, Magda, 45Shaban, Khaled, 152Shafi, Jawad, 91Shah, Kashif, 145Shahrour, Anas, 149Shaikh, Samira, 39, 130Shamsfard, Mehrnoush, 94Shan, Muhammad, 63Sharjeel, Muhammad, 63Sharma, Dipti, 60, 82, 90, 159Sharma, Himanshu, 60Sharoff, Serge, 127Sheikh, Imran, 133Shen, Wade, 11Sheridan, Páraic, 34Shi, Huaxing, 101Shindo, Hiroyuki, 58Shiue, Yow-Ting, 8Shooshan, Sonya, 115Shrestha, Niraj, 49Shrestha, Prasha, 118Shukla, Rajita, 106Sidarenka, Uladzimir, 39Sidorov, Maxim, 3Sierra, Gerardo, 146Siklósi, Borbála, 45Silva, Guilherme, 86Silva, João, 54, 105Silveira, Natalia, 57Sim Smith, Karin, 142Simi, Maria, 58Simkó, Katalin Ilona, 100Simões, Alberto, 93Simonyi, András, 84Simov, Kiril, 105Simunic, Roman Nino, 98Singh, Dhirendra, 81, 106Sitaram, Sunayana, 119Skadina, Inguna, 21Skadinš, Raivis, 44Skoumalová, Hana, 88Škrabal, Michal, 88Skrelin, Pavel, 67

Smith, Daniel, 90Smrz, Pavel, 114Šnajder, Jan, 58, 93, 105Sobhani, Parinaz, 136Søgaard, Anders, 30Sohn, Sunghwan, 15Solda Kutzmann, Donatella, 14Soler, Juan, 44Solorio, Thamar, 118Sommerdijk, Bridget, 103Song, Zhiyi, 129Sordo, Mohamed, 115Sørensen, Nicolai Hartvig, 30Soria, Claudia, 88, 113Soriano Morales, Edmundo Pavel, 81Soroa, Aitor, 140, 156Sosoni, Vilelmini, 1Specia, Lucia, 78, 107, 126, 127, 142Spektors, Andrejs, 89Speranza, Manuela, 152Sperber, Matthias, 69Spitkovsky, Valentin I., 30Sproat, Richard, 69Sprugnoli, Rachele, 14, 15, 121Srijith, P. K., 151Srikumar, Vivek, 141Štajner, Sanja, 21, 96Stankovic, Ranka, 18Staš, Ján, 66, 162Stede, Manfred, 35, 36, 59Steen, Julius, 82Štefanec, Vanja, 112Stefanov, Kalin, 153Stefas, Mickael, 11, 49Steffen, Diana, 30Stegen, Florian, 161Stein, Achim, 24, 83Steinberger, Josef, 29Steinberger, Ralf, 19Steiner, Petra, 38Stenger, Irina, 145Stent, Amanda, 17, 107Štepánek, Jan, 61Stepanov, Evgeny, 5, 153Stevens, Christopher, 109Stoitsis, Giannis, 144Stokowiec, Wojciech, 97Straka, Milan, 45, 149

182

Straková, Jana, 149Stranák, Pavel, 88, 100Stranisci, Marco, 101Strapparava, Carlo, 131Strassel, Stephanie, 32, 102, 114, 129, 147, 157Strik Lievers, Francesca, 79Strik, Helmer, 8, 28Strötgen, Jannik, 128Strzalkowski, Tomek, 39, 130Stüker, Sebastian, 64Su, Keh-Yih, 101Suderman, Keith, 16, 87Sukhareva, Maria, 51, 74Sulea, Octavia-Maria, 117Sumita, Eiichiro, 55, 77Sun, Ming, 109Sundberg, Gunlög, 7Surdeanu, Mihai, 6, 11, 103Sutcliffe, Richard, 13Suzuki, Kanta, 54Sylak-Glassman, John, 108, 109Szabó, Martina Katalin, 100

TTaatgen, Niels, 109Tachibana, Ryuichi, 13Tack, Anaïs, 8Tadic, Marko, 147Takahashi, Fumihiko, 161Takamura, Hiroya, 3Takeuchi, Moe, 147Tambouratzis, George, 20Tamburini, Fabio, 41, 72Tamchyna, Aleš, 123Tamisier, Thomas, 11, 49Tamres-Rudnicky, Yulian, 109Tanaka, Takaaki, 57Tanev, Hristo, 40Tannier, Xavier, 39, 67Tateisi, Yuka, 132Tavarez, David, 26Teh, Phoey Lee, 91Teich, Elke, 67Teisseire, Maguelonne, 58Tekiroglu, Serra Sinem, 131Telaar, Dominic, 25Tellier, Isabelle, 20, 128Temnikova, Irina, 10, 17, 97, 126Teng, Zhiyang, 8

Teraoka, Takehiro, 160Terbeh, Naim, 73Tetreault, Joel, 17Tettamanzi, Andrea, 10Teufel, Simone, 152Thadani, Kapil, 107Thater, Stefan, 7, 30, 59, 121Thomas, Beverley, 44Thomaschewski, Jörg, 4Thompson, Paul, 63Thunes, Martha, 123Tian, Ran, 160Tian, Tian, 20Tian, Ye, 61Tiedemann, Jörg, 32, 122Tim, Oates, 93Timmermans, Benjamin, 74Timmons, Tamara, 118Tjong Kim Sang, Erik, 44Tkachenko, Alexander, 85Tobin, Richard, 136Todo, Naoya, 96Tokunaga, Takenobu, 152Tolins, Jackson, 120, 121Tomlinson, Marc, 146Tonelli, Sara, 14, 31Toral, Antonio, 102, 103, 157Toussaint, Yannick, 151Toutanova, Kristina, 23Tracey, Jennifer, 102, 114, 157Trancoso, Isabel, 134Tratz, Stephen, 81Traum, David, 5, 95Trilsbeek, Paul, 86Trippel, Thorsten, 86Trips, Carola, 38Trmal, Jan, 46Troncy, Raphael, 19Trouvain, Juergen, 46Trtovac, Aleksandra, 18Trunecek, Petr, 88Tsarfaty, Reut, 57Tsuchiya, Tomoyuki, 153Tsuruoka, Yoshimasa, 49Tsvetanova, Liliya, 52Tu, Zhaopeng, 95Tufis, Dan, 87Tulkens, Stephan, 143

183

Tuomisto, Matti, 113Turtle, Howard R., 40Tuttle, Siri, 114Tyers, Francis, 89, 90

UUchimoto, Kiyotaka, 77Uematsu, Sumire, 57Ueno, Hiroshi, 95Umata, Ichiro, 147Ungar, Lyle, 129Unger, Christina, 121Uresova, Zdenka, 83, 138Uria, Larraitz, 37Urizar, Ruben, 152Uro, Jim, 70Uryupina, Olga, 71Ushiku, Atsushi, 23, 49Uszkoreit, Hans, 84, 117Utiyama, Masao, 55, 77Utka, Andrius, 88Utsuro, Takehito, 66Uva, Antonio, 15Uzair, Muhammad, 28

VVacher, Michel, 48, 68Vaidya, Ashwini, 82Vala, Hardik, 7Valadas Pereira, Rita, 54Valderrama, Jorge, 29Valenzuela-Escárcega, Marco A., 6, 11Vallet, Félicien, 70Valli, André, 79Vallmitjana, Jordi, 17Valmaseda, Carlos, 35Van de Velde, Hans, 162van den Bosch, Antal, 1, 44, 103van den Heuvel, Henk, 35, 113, 162van der Goot, Rob, 23Van der Kuip, Frits, 162van der Sijs, Nicoline, 44, 113Van der Veen, Bas, 86van Erp, Marieke, 19, 151, 152Van Eynde, Frank, 23van Genabith, Josef, 21, 55Van hamme, Hugo, 133van Harmelen, Martin, 48Van Hee, Cynthia, 62

van Hout, Roeland, 111Van Huyssteen, Gerhard, 23van Leeuwen, David, 162van Miltenburg, Emiel, 74Van Niekerk, Daniel, 23van Son, Chantal, 41, 152van Stipriaan, René, 44Vanallemeersch, Tom, 123Vandeghinste, Vincent, 23, 123Vanin, Aline, 6Varela, Rocio, 68, 134Varga, Viktor, 100Vasilaki, Kyriaki, 120Vasiljevs, Andrejs, 44, 55Väyrynen, Jaakko, 19Vela, Mihaela, 21, 78Velldal, Erik, 60Vempala, Alakananda, 132Venturi, Giulia, 4Verdonik, Darinka, 162Verhagen, Marc, 16, 87Verhoeven, Ben, 56Vernerová, Anna, 29, 30Versley, Yannick, 82Verstoep, Kees, 156Verwimp, Lyan, 133Vetulani, Grazyna, 99Vetulani, Zygmunt, 99Vidra, Jonáš, 45Vieira, Renata, 6, 66, 71Vieu, Laure, 131Vilares, David, 144Villata, Serena, 43Villavicencio, Aline, 42, 80, 92, 127Villegas, Marta, 31Villemonte de la Clergerie, Eric, 123Vincze, Veronika, 100Virone, Daniela, 56Viswanathan, Akshay, 12Viszlay, Peter, 162Vitkut-Adžgauskien, Daiva, 88Vitvar, Tomas, 115Vo, Ngoc Phuoc An, 117Vogel, Stephan, 126Voisin, Sylvie, 133Volk, Martin, 34Volodina, Elena, 7, 8Volskaya, Nina, 67

184

Von Reihn, Daniel, 86Vondricka, Pavel, 88vor der Brück, Tim, 53Vossen, Piek, 41, 51, 59Vulcu, Gabriela, 15

WWachsmuth, Sven, 120Wacker, Philippe, 16Wagner, Agnieszka, 162Wagner, Petra, 120Wagner, Sven, 115Waibel, Alex, 64, 69Waitelonis, Joerg, 151Wald, Mike, 25Walker, Kevin, 147Walker, Marilyn, 36, 110, 120, 121, 154Walker, Martin, 63Wallner, Franziska, 10Walshe, Brian, 56Walther, Désirée, 50Wambacq, Patrick, 133Wan, Yan, 68Wang, Cheng, 23Wang, Josiah, 106Wang, Lin, 37Wang, Longyue, 95Wang, Meikun, 118Wang, Shih-Ming, 94Wang, Yingying, 121Wanner, Leo, 44, 71, 80Wanzare, Lilian D. A., 121Wartena, Christian, 72Washington, Jonathan, 89Watanabe, Ryoko, 153Wawer, Aleksander, 101Way, Andy, 1, 34, 77, 95Webber, Bonnie, 137Weichselbraun, Albert, 116Weigert, Kathrin, 10Weiner, Jochen, 25Wellner, Christian, 30Wendelstein, Britta, 25Werner, Steffen, 26Westpfahl, Swantje, 10, 52White, Michael, 110Wi, Chung-Il, 15Wieling, Martijn, 114Wierzchon, Piotr, 142

Wijnhoven, Kars, 110Wilkens, Rodrigo, 80, 127Wilkinson, Bryan, 93Windhouwer, Menzo, 86Wintner, Shuly, 145Wisniewski, Guillaume, 53Witkowski, Wojciech, 83Witt, Andreas, 98, 124Wolff, Christian, 70Wolinski, Marcin, 90Wong, Tak-sum, 58Wong, Timothy, 133Wonsever, Dina, 72, 127Wörtwein, Torsten, 17Wottawa, Jane, 112Wrede, Britta, 50, 120Wrede, Sebastian, 50, 120Wright, Jonathan, 147Wu, Stephen, 15, 118Wu, Xiaofeng, 77Wu, Yi, 135Wubben, Sander, 143Wyner, Adam, 13, 117

XXia, Fei, 119Xiao, Liumingjing, 51Xiong, Dan, 133Xu, Feiyu, 84, 117Xu, Hongzhi, 139Xu, Yong, 22Xue, Nianwen, 32

YYaguchi, Manabu, 77Yahya, Emad, 152Yamada, Masaru, 139Yamamoto, Seiichi, 9, 147Yaneva, Victoria, 10, 17Yang, An, 51Yang, Diyi, 45Yang, Haojin, 23Yang, Jie, 8Yang, Yating, 35Yangarber, Roman, 108Yanovich, Polina, 107Yarowsky, David, 108, 109Yates, Amy, 118Yates, Andrew, 104

185

Yeh, Eric, 72Yetisgen, Meliha, 119Yeung, Chak Yan, 37Yilmaz, Emre, 28, 162Yokomori, Daisuke, 153Yoshino, Koichiro, 76, 161Young, Steve, 63Yu, Hwanjo, 128Yu, Roy Shing, 133Yu, Zhiwei, 4Yuan, Yu, 127Yvon, François, 22, 53

ZŽabokrtský, Zdenek, 4, 45Zaghouani, Wajdi, 64, 126Zaiß, Melanie, 42Zampieri, Marcos, 21, 62, 142Zaragoza, Hugo, 29Zarcone, Alessandra, 121Zargayouna, Haifa, 128Zarghili, Arsalan, 33Zarrieß, Sina, 5Zasina, Adrian Jan, 88Zayed, Omnia, 32Zeman, Daniel, 4, 57, 138Zesch, Torsten, 148Zeyrek, Deniz, 125Zgank, Andrej, 162Zhang, Jiajun, 35Zhang, Junhao, 51Zhang, Meishan, 8Zhang, Wanru, 104Zhang, Xiaojun, 95Zhang, Yue, 8, 23, 30, 46Zhang, Ziqi, 78Zhao, Chen, 66Zhao, Tiejun, 101Zhao, Wenli, 135Zhou, Hao, 23Zhou, Xi, 35Zhu, Xiaodan, 136Zi, Wenjie, 103Ziai, Ramon, 136Ziemski, Michał, 122Zilio, Leonardo, 92, 127Zimmerer, Frank, 46Zinn, Claus, 86Zipser, Florian, 155

Zong, Chengqing, 35Zorn, René, 120Zrigui, Mounir, 73Zséder, Attila, 148Zubiaga, Arkaitz, 102Zuccon, Guido, 67Zweigenbaum, Pierre, 70, 80, 110Zydron, Andrzej, 1

186