Post on 04-Mar-2023
TENTH INTERNATIONAL CONFERENCE ON
LANGUAGE RESOURCES AND EVALUATION
Held under the Honorary Patronage of His Excellency Mr. Borut Pahor, President of the Republic of Slovenia
MAY 23 – 28, 2016
GRAND HOTEL BERNARDIN CONFERENCE CENTRE
Portorož, SLOVENIA
CONFERENCE ABSTRACTS
Editors: Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis.
Assistant Editors: Sara Goggi, Hélène Mazo
The LREC 2016 Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
ii
LREC 2016, TENTH INTERNATIONAL CONFERENCE ON LANGUAGE
RESOURCES AND EVALUATION
Title: LREC 2016 Conference Abstracts
Distributed by:
ELRA – European Language Resources Association 9, rue des Cordelières 75013 Paris France Tel.: +33 1 43 13 33 33 Fax: +33 1 43 13 33 30 www.elra.info and www.elda.org Email: info@elda.org and lrec@elda.org
ISBN 978-2-9517408-9-1 EAN 9782951740891
iii
Introduction of the Conference Chair and ELRA President
Nicoletta Calzolari
Welcome to the 10th edition of LREC in Portorož, back on the Mediterranean Sea!
I wish to express to his Excellency Mr. Borut Pahor, the President of the Republic of Slovenia, the gratitude of the Program Committee, of all LREC participants and my personal for his Distinguished Patronage of LREC 2016.
Some figures: previous records broken again!
It is only the 10th LREC (18 years after the first), but it has already become one of the most successful and popular conferences of the field. We continue the tradition of breaking previous records. We received 1250 submissions, 23 more than in 2014. We received 43 workshop and 6 tutorial proposals.
The Program Committee is confronted at every LREC with a harder and harder job, going through 3750 reviews, to understand – also beyond the scores and in particular when they greatly differ – the relevance, the novelty, but also the appropriateness for an oral or poster presentation. We have in the program 744 papers: 203 Orals and 541 Posters.
We recruited an impressive number of reviewers, 1046 (76 more than in 2014), to keep the number of papers per reviewer rather low. This was a great effort in which a very large amount of our community was involved. To reach this number we had to invite 1427 colleagues, out of which 182 declined and 199 regrettably did not answer. At the end I must say that a number of reviewers (not so many) did not do their duty and we had to recruit some others in a hurry.
We also have 30 Workshops and 6 Tutorials.
More than 1100 participants have already registered at the beginning of May.
These figures and the continuously growing trend have a clear meaning. The field of Language Resources and Evaluation is very alive and constantly flourishing. And LREC seems still to be – as many say – “the conference where you have to be and where you meet everyone”.
LREC acceptance rate: a reasoned choice
Also this time, as I usually do, I want to highlight the LREC acceptance rate, 59.52 this year, unusual in other major conferences but for us a reasoned choice. This level of acceptance rate is a special feature of LREC and is probably one of the reasons why LREC succeeds to give us an overall picture of the field and to reveal how it is evolving. For us it is in fact important not only to look at the top methods but also to see how much various methods or resources are able to spread, for which purposes and usages and among which languages. Multilingualism – and equal treatment of all languages – is an essential feature of LREC, as it is the attempt of putting the text, speech and multimodal communities together.
The acceptance rate goes together with the sense of inclusiveness that is important for us (instead of the sense of elitism associated with a low acceptance rate).
iv
And I want to underline again that quality is not necessarily undermined by a high acceptance rate, but it is also determined by the influence of the papers on the community: the ranking of LREC among other conferences in the same area proves this. According to Google Scholar h-index, LREC ranks 4th in Computational Linguistics top publications.
I was really proud when a colleague recently told me that LREC, with its broad variety of topics, is the conference where he gets more ideas than in any other!
LREC 2016 Trends
From one LREC to another I have tried (since 2004) to spot, even if in a cursory and subjective way, the major trends and the going up and down of certain topics.
After highlighting the major trends of 2016, also in comparison to 2014 and previous years, I make here few general considerations. The comparison with previous years highlights the topics where we find steady progress, or even great leaps forward, the stable ones and those that may be more affected by the fashion of the moment.
Trends in LREC2016 topics, also compared to 2014
Among the areas that continue to be trendy and are increasing I can mention: ▪ Social Media analysis, started in 2012 and increasing in 2014, is doubling again ▪ Discourse, Dialogue and Interactivity ▪ Treebanks, with a big increase with respect to the past ▪ Less-resourced languages ▪ Semantics in general and in particular Sentiment, Emotion and in general Subjectivity ▪ Information extraction, Knowledge discovery, Text mining ▪ Multilinguality in general and Machine Translation ▪ Evaluation methodologies
Unsurprisingly many papers in the “usual” topics: ▪ Lexicons, even if a bit decreasing ▪ Corpora ▪ Infrastructural issues, policies, strategies and Large projects: topics that receive special
attention at LREC, differently from other major conferences. Another distinguishing feature for LREC.
Newer trends: ▪ Digital Humanities ▪ Robotics
Stable topics: ▪ Speech related topics, a little increasing but not as much as we would like ▪ Multimodality ▪ Grammar and syntax ▪ Linked data, a new topic in 2014, remains stable ▪ Computer Aided Language Learning, an increasing topic in 2014, is stable
v
Less-represented topics with respect to the past: ▪ Web services and workflows ▪ Sign language (probably because there is a very successful workshop on this) ▪ Ontologies ▪ Standards and metadata ▪ Temporal and spatial annotation ▪ Crowdsourcing
Overall trends, from 2004 … and before
From 2004 we observe a big increase in papers related to Multilingualism and Machine Translation. This may also be related to the funding of Machine Translation projects from the European Commission.
The analysis of Social media and Subjectivity with sentiment, opinion, emotions, has started in this time span and is not only well consolidated but also continually expanding.
There is declining tendency for papers relayed to grammar and syntax. This however makes even more interesting the high increase of papers on Treebanks this year.
There seems to be a small decrease of papers on Lexicons and Lexical acquisition as well as on Terminology. It was probably a more popular topic years ago when many WordNet and FrameNet lexicons were built for many languages.
I have recently been reminded by a colleague at ILC of a paper I wrote many years ago with some considerations on Computational Linguistics as reflected by the papers at COLING 1982. There was obviously no mention of Language Resources (the term itself was coined by Zampolli later on), but I underlined already then the element of novelty constituted by the area that was called at that time “Linguistic Data Bases”, with some papers on dictionaries “in machine-readable form”. While I never mentioned in my review, and probably the papers too, the word “Corpus”! Only 30 years ago Computational Linguistics was a totally different field. The new area of Language Resources was born some years after those initial pioneering sparse papers. But this new topic, as testified by the success of LREC, has expanded incredibly fast.
And a new community has taken shape around Language Resources. A peculiarity of this community is the attention paid to infrastructural issues, to overall strategies and policies. This is also due, I believe, to the fact that in many cases we have to work in large groups, for many languages, we must be able to build on each other work, to connect different resources and tools, to make available what already exists and use standardised formats. Infrastructures (on many dimensions) are really needed for this field to progress.
I wrote in the introduction to LRE2006, 10 years ago: “Do we have revolutions? Probably not. Even if the stable growth of the field brings in itself some sort of revolution. After a proliferation of LRs and tools, we need now to converge. We need more processing power, more integration of modalities, more standards and interoperability, more sharing (in addition to distribution), more cooperative work (and tools enabling this), which means also more infrastructures and more coordination.” I think that many of the needs that I expressed then are being achieved today, as testified by the papers in this edition of LREC, and therefore we can probably speak of a sort of quiet revolution.
vi
LREC Proceedings in Thomson Citation Index
I remind also that since 2010 the LREC Proceedings have been accepted for inclusion in CPCI (Thomson Reuters Conference Proceedings Citation Index). This is quite an important achievement, providing a better recognition to all LREC authors and useful in particular for young colleagues.
ELRA and LREC
ELRA 20th Anniversary: achievements, promotion of new trends and community building
In 2015 we organised a workshop for the 20th anniversary of ELRA, founded in 1995. I think it is a big success the fact that ELRA has remained in the Language Technology picture with growing influence in these 20 years, even more so given that ELRA does not rely on specific public funding.
ELRA has implemented, over the years, many services for the language resource community, promoting the integration not only of different modalities but also of different communities. And ELRA itself has evolved to reflect the evolution of the field and sometimes to anticipate new trends.
I mention here a few services: Evaluation, Validation, Production of LRs, Networking, LREC, LRE Map, Sharing of LRs, Licensing wizard, ISLRN, LRE Journal, Less-resourced languages committee. These initiatives must not be seen as unrelated steps, but as part of a coherent vision, promoting a new culture in our community. Many of ELRA actions are of infrastructural nature, knowing that research is affected also by such infrastructural activities.
LREC is probably the most influential ELRA achievement, with the major impact on the overall community. Also through LREC, ELRA has certainly contributed to shape our field, making the Language Resource field a scientific field in its own right.
This is testified also by the LRE journal, coedited by Nancy Ide and myself, with submissions in continuous increase.
Citation of Language Resources
This year we introduced and encouraged citations of Language Resources, providing recommendations on how to cite: we have in fact added a special References section. This must become normal practice, to keep track of the relevance of LRs but also to provide due recognition to those who work on language resources.
This will be implemented also in the LRE journal.
Replicability of research results
We continue to promote, also through many of the initiatives above, greater visibility of LRs, sharing LRs in an easier way and replicability of research results as a normal part of scientific practice. ELRA is thus strengthening the LR scientific ecosystem and fostering sustainability.
vii
Acknowledgments
As usual, it is my pleasure to express here my deepest gratitude to all those who made this LREC 2016 possible and hopefully successful.
I first thank the Program Committee members, not only for their dedication in the huge task of selecting the papers, the workshops and tutorials, but also for the constant involvement in the various aspects around LREC. A particular thanks goes to Jan Odijk, who has been so helpful in the preparation of the programme. To Joseph Mariani for his always wise suggestions. And obviously to Khalid Choukri, who is in charge of so many aspects around LREC.
I thank ELRA and the ELRA Board: LREC is a major service from ELRA to all the community!
A very special thanks goes to Sara Goggi and Hélène Mazo, the two Chairs of the Organizing Committee, for all the work they do with so much dedication and competence, and also the capacity to tackle the many big and small problems of such a large conference (not an easy task). They are the two pillars of LREC, without whose commitment for many months LREC would not happen. So much of LREC organisation is on their shoulders, and this is visible to all participants.
I am grateful also to the Local Committee, especially to Simon Krek (its Chair) and Marko Grobelnik, for their help in organising a successful LREC.
My appreciation goes also to the distinguished members of the Local Advisory Committee for their constant support.
I express my great gratitude to the Sponsors that believe in the importance of our conference, and have helped with financial support. I am grateful to the authorities, and the associations and organisations that have supported LREC in various ways.
Furthermore, on behalf of the Program Committee, I praise our impressively large Scientific Committee. They did a wonderful job.
I thank the workshop and tutorial organisers, who complement LREC of so many interesting events.
A big thanks goes to all the LREC authors, who provide the “substance” to LREC, and give us such a broad picture of the field.
I finally thank the two institutions that always dedicate a great effort to LREC, i.e. ELDA in Paris and ILC-CNR in Pisa. Without their commitment LREC would not be possible. The last, but not least, thanks are thus, in addition to Hélène Mazo and Sara Goggi, to all the others who – with different roles – have helped and will help during the conference: Paola Baroni, Roberto Bartolini, Dominique Brunato, Irene De Felice, Riccardo Del Gratta, Meritxell Fernández Barrera, Francesca Frontini, Lin Liu, Valérie Mapelli, Monica Monachini, Vincenzo Parrinelli, Vladimir Popescu, Valeria Quochi, Caroline Rannaud, Irene Russo, Priscille Schneller, Alexandre Sicard. You will meet most of them during the conference.
I also hope that funding agencies will be impressed by the quality and quantity of initiatives in our sector that LREC displays, and by the fact that the field attracts all the best groups of R&D from all continents. The success of LREC for us actually means the success of the field of Language Resources and Evaluation.
viii
And lastly, my final words of appreciation are for all the LREC 2016 participants. Now LREC is in your hands. You are the true protagonist of LREC; we have worked for you all and you will make this LREC great. I hope that you discover new paths, that you perceive the ferment and liveliness of the field, that you have fruitful conversations (conferences are useful also for this) and most of all that you profit of so many contacts to organise new exciting work and projects in the field of Language Resources and Evaluation … which you will show at the next LREC.
LREC is again in a Mediterranean location this time! I am sure you will like Portorož and Slovenia and the Mediterranean atmosphere. And I hope that Portorož and Piran will appreciate the invasion of LRECers!
With all the Programme Committee, I welcome you at LREC 2016 and wish you a very fruitful Conference.
Enjoy LREC 2016 in Portorož!
Nicoletta Calzolari
Chair of the 10th International Conference on Language Resources & Evaluation and ELRA President
ix
Message from ELRA Secretary General and ELDA Managing Director
Khalid Choukri
Welcome to LREC 2016, the 10th edition of one of the major events in language sciences and technologies and the most visible service of the European Language Resources Association (ELRA) to the community. Let me first express, on behalf of the ELRA/ELDA and LREC team, our profound gratitude to His Excellency Mr. Borut Pahor, President of the Republic of Slovenia, for his Distinguished Patronage of LREC2016 and for honoring us with his presence.
ELRA & LREC
Since 1998 and the first LREC in Granada, it has been a privilege to speak to the language sciences and technologies communities every two years. We always feel gifted to have the chance to share ELRA's views, concerns, expectations, and plans for the future, with over a thousand experts gathered in a lovely and relaxed atmosphere. It is also an important occasion to report on our observations of the community activities, in particular on the recent trends, that we continuously monitor. It is very rewarding to do so, in a place to be remembered. The ELRA Board has always done its best to make this event memorable, an event with a rich scientific content, offering great networking opportunities, occasions to initiate new projects, to form new friendship with colleagues in the same field, and a lot of emotions. Let me fist share with you a deep feeling about our organization of LREC before expressing some remarks about our recent activities and those of our community. When discussing LREC within the ELRA Board, the first major step to be taken relates to identifying the next location. The usual and typical requirements, concerning logistics, are the first and most critical debated criteria. The motivation and the support of a local community is also a decisive factor. Remember that there is no call for bids but rather spontaneous proposals from local teams. Since its inception, LREC has been organized by the same permanent team, with the support of local committees, representing our field. Last but not least, the attractiveness of the location is a crucial element, now part of the conference’s DNA. Our motto has always been that working and meeting in very relaxing conditions can improve our efficiency and productivity and enhance our interactions. We never jeopardized the scientific part when selecting our locations.
ELRA monitoring the new HLT trends
Over the last 9 LRECs and this tenth one, we have had the privilege to witness the emergence of new research fields but also the deployment of impressive applications, that, we, as a community (and also as ELRA) have proudly set-up the ground for. ELRA was established in 1995 and the first LREC took place in Granada in 1998. Last November 2015, we celebrated our 20th anniversary. It was very exciting to discuss how far we have really come in fulfilling some of the objectives that were essential to the development of our field: many of us remember how challenging it was to develop automatic speech recognition
x
for very basic tasks (even to recognize the 10 digits and some command words, or to discriminate between all forms of Yes and No, in a few languages, the usual “big” ones), not to mention dialogue systems and speech understanding. How challenging and critical it was to make tiny progresses in Machine Translation systems, mostly for a few "lucrative" languages and to develop methodologies to assess such performances. Extending these tasks with more challenging achievements is today one of the core activities of many large research teams all over the world, and one can imagine the underlying techniques composing all these applications, from processing capacities to basic tools (image and acoustic processing, morphological, syntactic, sematic analysis) and curation of appropriate Language Resources. Everything that is affordable for many players, using open source packages, cloud computing facilities, etc. Breakthroughs are impressive and many applications are now deployed and used by the public at large. However, the bottleneck remains, as in the past, the availability of open licensable Language Resources and their sharing within the community, in particular for the less-resourced languages. Our community is probably one of the "Big Data" communities that benefited largely from the emergence of the web. The web came as a repository of treasures of data that pushed our "data-driven" paradigms. The web brought also new challenges and potential problems related to language processing, understanding, summarization, generation, translation, etc. The web boosted new needs for tackling data and text mining, sentiment analysis, opinion detection, etc. Twitter and other social media, for instance, have now become one of the main data sources while generating scientific problems the research communities have to address. These trends are now common ground to many research topics and activities, but they also highlight the serious gaps between languages and the many barriers that are hindering the progress in various fields and geographical areas.
ELRA and the legal issues
In addition to the problems related to discovering and interoperating data sets, a critical obstacle that has limited our investigations has to do with legal and ethical aspects. These are important issues and ELRA always worked on raising the awareness of the various stakeholders (research and industrial communities as well as policy makers). I often stressed this in my speeches at LRECs: the legal and ethical issues are major topics that require more lobbying and petitioning from our community. The research progress, in particular when it is not-for-profit and carried out in academic spheres, should not be impeded by so many obstacles, most of which should have disappeared with the emergence of internet and the digital world. The most critical issue is the copyright, along with related laws and regulations, which prevents the re-use of data for research. The idea is not to restrict the rights of authors and other intellectual property of authors and creators, but, on the contrary, to ensure a legal fair use of copyrighted works by the research community to prevent misuses that are very common, as illustrated by the statement "the web as a corpus", which seems to imply that online content is by default freely reusable. A few countries have adopted the doctrine of fair use, giving to their research community a very "competitive" advantage. This is the case of the USA (where the Federal Government works are exempt from copyright protection). An important move from the European Commission (EC) is its challenging objective to establish a Single Digital Market across the European Union Member States. The language barriers are only one of the obstacles that the
xi
European Commission will face in establishing the SDM. To boost the innovation based on the data held by the public sector bodies, the EC has issued an important regulation, the Public Sector Information (PSI)1 Directive. It requires all public bodies to release the data they produce to the public so the data can be used for innovative applications. Linguistic data is part of the deal and we hope to see more resources with cleared IPR in our catalogues and in our repositories soon, for use by all. The USA and European Union are simple examples of a large international movement that hopefully will be beneficial to our community. However, as legal aspects are always subject to interpretation, the "Territoriality and Extraterritoriality in Intellectual Property Law"2 does not help to clear these issues. It is crucial that an international harmonization addresses this (maybe a serious amendment of existing conventions). Our community should contribute to the initiatives going on in various countries about the copyright and other related rights in the information society3, to boost data sharing and re-use. ELRA strongly supports the Open Data movement and advocates for making public sector information more accessible and re-usable, without any license or through very permissive and open licenses. Another topic debated at the 20th anniversary of ELRA, is the ethical issues that have to be considered in our field, either when dealing with Data Management (for instance the crowdsourcing approach to data production) or when replicating experiments and citing publications of other colleagues. I am very happy to see that the dedicated workshop on "Legal Issues" (http://www.elra.info/en/dissemination/elra-events/legal-issues-workshop-lrec2016/) is taking place at LREC, once more, with a large number of registered participants. Another workshop on ethical issues, is scheduled within this LREC (ETHics In Corpus Collection, Annotation and Application, http://emotion-research.net/sigs/ethics-sig/ethi-ca2).
Replicating experiments and Data Citation
This crucial topic of replicating experiments covers a large spectrum of behaviors and was reviewed by the experts group, gathered to discuss the activities of our field at the twentieth anniversary of ELRA. The topic refers to "Reproducibility of the Research Results and Resources Citation in Science and Technology of Language". The topic has emerged as a hot topic for discussion within many fields of research activities. With António Branco and Nicoletta Calzolari, we made a proposal for a discussion workshop to debate this topic at LREC. I am very grateful to António Branco who agreed to take the lead on this sensitive topic (4REAL Workshop: http://4real.di.fc.ul.pt/). Like in many scientific fields, several dimensions are important and require specific considerations. Of course, maintaining research integrity is essential and requires that replication of published results is possible, and even guaranteed. Such replication can only be done through sharing of resources and approaches. The same requirements are needed to compare results across approaches which, in addition, require
1 https://ec.europa.eu/digital-single-market/en/european-legislation-reuse-public-sector-information 2 Alexander Peukert, “Territoriality and extra-territoriality in intellectual property law”, in Günther Handl, Joachim Zekoll & Peer Zumbansen (eds.), Beyond Territoriality: Transnational Legal Authority in an Age of Globalization, Queen Mary Studies in International Law, Brill Academic Publishing, Leiden/Boston, 2012, 189-228. 3 Study on the application of Directive 2001/29/EC on copyright and related rights in the information society (the "Infosoc Directive"), Jean-Paul Triaille, Séverine Dusollier, Sari Depreeuw, Jean-Paul TRIAILLE (ed.), ISBN: 978-92-79-29918-6, DOI: 10.2780/90141, © European Union, 2013.
xii
that one clearly identifies the resources used in the benchmarking. The identification of the resources, and thus, their citations, are correlated and more attention is required than what we did so far. The use of the ISLRN (International Standard Language Resource Number, http://www.islrn.org/) has been introduced and is being supported by the major data centers. We hope that monitoring this process will show how impactful it is on our field. Citation mechanisms, applied to Language Resources, affect the so called "impact factor" and the underlying research indexes, and have to be taken care of by the community.
ELRA & Data Management Plan
Another major topic that was briefly discussed at the ELRA 20th anniversary is Data Management Plans and the underlying Language Resource Sustainability factors (DMP&S). Within its activities and through a number of projects, ELRA has always advocated for a comprehensive data management strategy that would ensure efficient management of the production, repurposing or repackaging processes, a clear and adequate validation process, sharing and distribution mechanisms, and a sustainability plan4, (http://www.flarenet.eu/sites/default/files/D2.2a.pdf). It is now common practice by most of the funding agencies to request a Data Management Plan to be part of the proposals they receive. We hope that such plans are seriously designed and not only as administrative sections of the proposals. ELRA will continue to work on this challenging task for the emerging resource types and contexts of use and will continue to offer its support (including through its helpdesk on technical and legal matters) to the proposers and project managers. ELRA is working on a very specific tool to help Language Resource managers produce their own DMP. ELRA is working on a specific wizard, based on its background and input from other projects (the MIT Libraries’ Research Data Management Plan5, and the initiative held by the Inter-university Consortium for Political and Social Research6). To assess how this "sustainability" dimension is taken into consideration, ELRA has established a set of resources that have been monitored, by our Language Resource experts, since 2010. We started the list with resources that were not part of the catalogues of data centers. One would be surprised to see that over 28% have simply disappeared from the radars, with many others requiring a thorough search before one can find them again. And it is always complicated to identify them accurately and verify they correspond to the ones on our list. Again this emphasizes the need for repositories that adhere to a clear code of conduct. The simple "web" (and URLs) will not constitute a reliable and persistent repository.
LREC 2016, some features
As usual, LREC 2016 features a large number of workshops. We are proud to continue to support the Sign language workshop that is building bridges between several research communities and setting partnerships for research on so many modalities (http://www.sign-lang.uni-hamburg.de/lrec2016/workshops.html). We are happy to see that the community is very active in paying attention to the less-resourced and under-resourced languages (http://www.ilc.cnr.it/ccurl2016/index.htm), a topic that is now taken care of by a dedicated
4 Khalid Choukri and Victoria Arranz, "An Analytical Model of Language Resource Sustainability", Proceedings of LREC2012. 5 https://libraries.mit.edu/data-management/plan/write/ 6 http://www.icpsr.umich.edu/
xiii
ELRA committee, the LRL committee. Many other specific topics are covered in this edition’s workshops: Arabic and Indian languages, social media, emotions, affect analysis, multimodal resources, and several workshops dealing with MT and MT-related aspects. LREC 2016 is also featuring a panel discussion with some of the major funding agencies. We hope to draw some conclusions about their past activities but also to discuss roadmaps for the next decade. Last but not least, the importance of our technologies was emphasized when a strong earthquake hit Nepal in April 2015. Many teams offered their technologies to help the rescue groups, in particular the multilingual applications. Some teams designed and quickly developed new applications for this. ELRA donated Nepalese resources for this purpose. We are thankful to our partners who agreed to waive all fees on these resources and very grateful to those who developed the applications that may have helped a little bit.
Acknowledgments:
Finally, I would like to express my deep thanks to our partners and supporters, who throughout
the years make LREC so successful. I would like to thank our Bronze sponsors: EML (European
Media Laboratory GmbH) and Intel, our supporter: Viseo, and our media sponsor: MultiLingual
Computing, Inc.
I also would like to thank the HLT Village participants, we hope that such gathering offers the
projects an opportunity to foster their dissemination and hopefully to discuss exploitation plans
with the participants.
I would like to thank the Local Advisory Committee. Its composition of the most distinguished
personalities of Slovenia denotes the importance of language and language technologies for
the country. We do hope that it is a strong sign for the long-term commitment of Slovenian
officials.
I would like to thank the LREC Local Committee, chaired by Dr Simon Krek and the LREC Local
Organizing Committee, chaired by Marko Grobelnik, in particular Špela Sitar and Monika Kropej for providing support to the organization of this LREC Edition in Slovenia.
Finally I would like to warmly thank the joint team of the two institutions that devoted so much
effort over months and often behind curtains to make this one week memorable: ILC-CNR in
Pisa and my own team, ELDA, in Paris. These are the two LREC coordinators and pillars: Sara
Goggi and Hélène Mazo, and the team: Roberto Bartolini, Irene De Felice, Meritxell Fernández-
Barrera, Riccardo Del Gratta, Francesca Frontini, Lin Liu, Valérie Mapelli, Monica Monachini,
Vincenzo Parrinelli, Vladimir Popescu, Caroline Rannaud, Irene Russo, Priscille Schneller and
Alexandre Sicard.
Now LREC 2016 is yours; we hope that each of you will achieve valuable results and accomplishments. We, ELRA and ILC-CNR staff, are at your disposal to help you get the best out of it. Once again, welcome to Portorož and Slovenia, welcome to LREC 2016
Table of Contents
O1 - Machine Translation and Evaluation (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
O2 - Sentiment Analysis and Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
O3 - Corpora for Language Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
O4 - Spoken Corpus Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
P01 - Anaphora and Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
P02 - Computer Aided Language Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
P03 - Evaluation Methodologies (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
P04 - Information Extraction and Retrieval (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
O5 - LR Infrastructures and Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
O6 - Multimodality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
O7 - Multiword Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
O8 - Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
P05 - Machine Translation (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
P06 - Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
P07 - Speech Corpora and Databases (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
P08 - Summarisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
P09 - Word Sense Disambiguation (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
O9 - Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
O10 - Multilingual Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
O11 - Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
O12 - OCR for Historical Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
P10 - Discourse (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
P11 - Morphology (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
P12 - Sentiment Analysis and Opinion Mining (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
P13 - Semantics (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
O13 - Large Projects and Infrastructures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
O14 - Document Classification and Text Categorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
O15 - Morphology (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
O16 - Phonetics and Prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
P14 - Lexical Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
P15 - Multimodality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
xv
P16 - Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
P17 - Part of Speech Tagging (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
P18 - Treebanks (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
O17 - Language Resource Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
O18 - Tweet Corpora and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
O19 - Dependency Treebanks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
O20 - Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
P19 - Discourse (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
P20 - Document Classification and Text Categorisation (1) . . . . . . . . . . . . . . . . . . . . . . . . . . 62
P21 - Evaluation Methodologies (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
P22 - Information Extraction and Retrieval (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
P23 - Prosody and Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
P24 - Speech Processing (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
O21 - Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
O22 - Anaphora and Coreference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
O23 - Machine Learning and Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
O24 - -Speech Corpus for Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
P25 - Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
P26 - Emotion Recognition/Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
P27 - Machine Translation (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
P28 - Multiword Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
P29 - Treebanks (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
P30 - Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
P31 - LR Infrastructures and Architectures (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
P32 - Large Projects and Infrastructures (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
P33 - Morphology (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
P34 - Semantic Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
O25 - Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
O26 - Discourse and Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
O27 - Machine Translation and Evaluation (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
O28 - Corpus Querying and Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
P35 - Grammar and Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
P36 - Sentiment Analysis and Opinion Mining (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
P37 - Parallel and Comparable Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
P38 - Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
P39 - Word Sense Disambiguation (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
O29 - Panel on International Initiatives from Public Agencies . . . . . . . . . . . . . . . . . . . . . . . . 106
O30 - Multimodality, Multimedia and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
O31 - Summarisation and Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
O32 - Morphology (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
xvi
P40 - Dialogue (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
P41 - Language Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
P42 - Less-Resourced Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
P43 - Named Entity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
O33 - Textual Entailment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
O34 - Document Classification, Text categorisation and Topic Detection . . . . . . . . . . . . . . . . . . . 118
O35 - Detecting Information in Medical Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
O36 - Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
O37 - Robots and Conversational Agents Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
O38 - Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
O39 - Corpora for Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
O40 - Treebanks and Syntactic and Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
P44 - Corpus Creation and Querying (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
P45 - Evaluation Methodologies (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
P46 - Information Extraction and Retrieval (3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
P47 - Semantic Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
P48 - Speech Processing (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
O41 - Discourse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
O42 - Twitter Related Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
O43 - Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
O44 - Speech Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
P49 - Corpus Creation and Querying (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
P50 - Document Classification and Text Categorisation (2) . . . . . . . . . . . . . . . . . . . . . . . . . . 142
P51 - Multilingual Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
P52 - Part of Speech Tagging (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
O45 - Lexicons: Wordnet and Framenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
O46 - Digital Humanities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
O47 - Text Mining and Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
O48 - Corpus Creation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
P53 - Dialogue (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
P54 - LR Infrastructures and Architectures (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
P55 - Large Projects and Infrastructures (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
P56 - Semantics (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
P57 - Speech Corpora and Databases (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Authors Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
xvii
O1 - Machine Translation and Evaluation (1)Wednesday, May 25, 11:35
Chairperson: Bente Maegaard Oral Session
Evaluating Machine Translation in a UsageScenario
Rosa Gaudio, Aljoscha Burchardt and António Branco
In this document we report on a user-scenario-based evaluation
aiming at assessing the performance of machine translation (MT)
systems in a real context of use. We describe a sequel of
experiments that has been performed to estimate the usefulness
of MT and to test if improvements of MT technology lead to
better performance in the usage scenario. One goal is to find
the best methodology for evaluating the eventual benefit of a
machine translation system in an application. The evaluation
is based on the QTLeap corpus, a novel multilingual language
resource that was collected through a real-life support service via
chat. It is composed of naturally occurring utterances produced
by users while interacting with a human technician providing
answers. The corpus is available in eight different languages:
Basque, Bulgarian, Czech, Dutch, English, German, Portuguese
and Spanish.
Using BabelNet to Improve OOV Coverage inSMT
Jinhua Du, Andy Way and Andrzej Zydron
Out-of-vocabulary words (OOVs) are a ubiquitous and difficult
problem in statistical machine translation (SMT). This paper
studies different strategies of using BabelNet to alleviate the
negative impact brought about by OOVs. BabelNet is a
multilingual encyclopedic dictionary and a semantic network,
which not only includes lexicographic and encyclopedic terms,
but connects concepts and named entities in a very large network
of semantic relations. By taking advantage of the knowledge
in BabelNet, three different methods – using direct training
data, domain-adaptation techniques and the BabelNet API – are
proposed in this paper to obtain translations for OOVs to improve
system performance. Experimental results on English-Polish
and English-Chinese language pairs show that domain adaptation
can better utilize BabelNet knowledge and performs better than
other methods. The results also demonstrate that BabelNet is a
really useful tool for improving translation performance of SMT
systems.
Enhancing Access to Online Education: QualityMachine Translation of MOOC Content
Valia Kordoni, Antal van den Bosch, Katia LidaKermanidis, Vilelmini Sosoni, Kostadin Cholakov, IrisHendrickx, Matthias Huck and Andy Way
The present work is an overview of the TraMOOC (Translation for
Massive Open Online Courses) research and innovation project,
a machine translation approach for online educational content.
More specifically, videolectures, assignments, and MOOC forum
text is automatically translated from English into eleven European
and BRIC languages. Unlike previous approaches to machine
translation, the output quality in TraMOOC relies on a multimodal
evaluation schema that involves crowdsourcing, error type
markup, an error taxonomy for translation model comparison, and
implicit evaluation via text mining, i.e. entity recognition and its
performance comparison between the source and the translated
text, and sentiment analysis on the students’ forum posts. Finally,
the evaluation output will result in more and better quality in-
domain parallel data that will be fed back to the translation
engine for higher quality output. The translation service will
be incorporated into the Iversity MOOC platform and into the
VideoLectures.net digital library portal.
The Trials and Tribulations of PredictingPost-Editing Productivity
Lena Marg
While an increasing number of (automatic) metrics is available
to assess the linguistic quality of machine translations, their
interpretation remains cryptic to many users, specifically in
the translation community. They are clearly useful for
indicating certain overarching trends, but say little about actual
improvements for translation buyers or post-editors. However,
these metrics are commonly referenced when discussing pricing
and models, both with translation buyers and service providers.
With the aim of focusing on automatic metrics that are easier to
understand for non-research users, we identified Edit Distance (or
Post-Edit Distance) as a good fit. While Edit Distance as such
does not express cognitive effort or time spent editing machine
translation suggestions, we found that it correlates strongly with
the productivity tests we performed, for various language pairs
and domains. This paper aims to analyse Edit Distance and
productivity data on a segment level based on data gathered over
some years. Drawing from these findings, we want to then explore
how Edit Distance could help in predicting productivity on new
1
content. Some further analysis is proposed, with findings to be
presented at the conference.
PE2rr Corpus: Manual Error Annotation ofAutomatically Pre-annotated MT Post-edits
Maja Popovic and Mihael Arcan
We present a freely available corpus containing source language
texts from different domains along with their automatically
generated translations into several distinct morphologically rich
languages, their post-edited versions, and error annotations of the
performed post-edit operations. We believe that the corpus will
be useful for many different applications. The main advantage
of the approach used for creation of the corpus is the fusion
of post-editing and error classification tasks, which have usually
been seen as two independent tasks, although naturally they are
not. We also show benefits of coupling automatic and manual
error classification which facilitates the complex manual error
annotation task as well as the development of automatic error
classification tools. In addition, the approach facilitates annotation
of language pair related issues.
O2 - Sentiment Analysis and EmotionRecognitionWednesday, May 25, 11:35
Chairperson: Núria Bel Oral Session
Sentiment Lexicons for Arabic Social Media
Saif Mohammad, Mohammad Salameh and SvetlanaKiritchenko
Existing Arabic sentiment lexicons have low coverage–with only
a few thousand entries. In this paper, we present several large
sentiment lexicons that were automatically generated using two
different methods: (1) by using distant supervision techniques
on Arabic tweets, and (2) by translating English sentiment
lexicons into Arabic using a freely available statistical machine
translation system. We compare the usefulness of new and old
sentiment lexicons in the downstream application of sentence-
level sentiment analysis. Our baseline sentiment analysis system
uses numerous surface form features. Nonetheless, the system
benefits from using additional features drawn from sentiment
lexicons. The best result is obtained using the automatically
generated Dialectal Hashtag Lexicon and the Arabic translations
of the NRC Emotion Lexicon (accuracy of 66.6%). Finally,
we describe a qualitative study of the automatic translations of
English sentiment lexicons into Arabic, which shows that about
88% of the automatically translated entries are valid for English
as well. Close to 10% of the invalid entries are caused by gross
mistranslations, close to 40% by translations into a related word,
and about 50% by differences in how the word is used in Arabic.
A Language Independent Method for GeneratingLarge Scale Polarity Lexicons
Giuseppe Castellucci, Danilo Croce and Roberto Basili
Sentiment Analysis systems aims at detecting opinions and
sentiments that are expressed in texts. Many approaches in
literature are based on resources that model the prior polarity of
words or multi-word expressions, i.e. a polarity lexicon. Such
resources are defined by teams of annotators, i.e. a manual
annotation is provided to associate emotional or sentiment facets
to the lexicon entries. The development of such lexicons is an
expensive and language dependent process, making them often
not covering all the linguistic sentiment phenomena. Moreover,
once a lexicon is defined it can hardly be adopted in a different
language or even a different domain. In this paper, we present
several Distributional Polarity Lexicons (DPLs), i.e. large-scale
polarity lexicons acquired with an unsupervised methodology
based on Distributional Models of Lexical Semantics. Given a set
of heuristically annotated sentences from Twitter, we transfer the
sentiment information from sentences to words. The approach is
mostly unsupervised, and experimental evaluations on Sentiment
Analysis tasks in two languages show the benefits of the generated
resources. The generated DPLs are publicly available in English
and Italian.
Sentiment Analysis in Social Networks throughTopic modeling
Debashis Naskar, Sidahmed Mokaddem, Miguel Rebolloand Eva Onaindia
In this paper, we analyze the sentiments derived from the
conversations that occur in social networks. Our goal is to identify
the sentiments of the users in the social network through their
conversations. We conduct a study to determine whether users
of social networks (twitter in particular) tend to gather together
according to the likeness of their sentiments. In our proposed
framework, (1) we use ANEW, a lexical dictionary to identify
affective emotional feelings associated to a message according to
the Russell’s model of affection; (2) we design a topic modeling
mechanism called Sent_ LDA, based on the Latent Dirichlet
Allocation (LDA) generative model, which allows us to find
the topic distribution in a general conversation and we associate
topics with emotions; (3) we detect communities in the network
according to the density and frequency of the messages among the
users; and (4) we compare the sentiments of the communities by
using the Russell’s model of affect versus polarity and we measure
the extent to which topic distribution strengthen likeness in the
sentiments of the users of a community. This works contributes
2
with a topic modeling methodology to analyze the sentiments in
conversations that take place in social networks.
A Comparison of Domain-based Word PolarityEstimation using different Word Embeddings
Aitor García Pablos, Montse Cuadros and German Rigau
A key point in Sentiment Analysis is to determine the polarity
of the sentiment implied by a certain word or expression. In
basic Sentiment Analysis systems this sentiment polarity of the
words is accounted and weighted in different ways to provide a
degree of positivity/negativity. Currently words are also modelled
as continuous dense vectors, known as word embeddings, which
seem to encode interesting semantic knowledge. With regard
to Sentiment Analysis, word embeddings are used as features
to more complex supervised classification systems to obtain
sentiment classifiers. In this paper we compare a set of existing
sentiment lexicons and sentiment lexicon generation techniques.
We also show a simple but effective technique to calculate a word
polarity value for each word in a domain using existing continuous
word embeddings generation methods. Further, we also show
that word embeddings calculated on in-domain corpus capture
the polarity better than the ones calculated on general-domain
corpus.
Could Speaker, Gender or Age Awareness bebeneficial in Speech-based Emotion Recognition?
Maxim Sidorov, Alexander Schmitt, Eugene Semenkin andWolfgang Minker
Emotion Recognition (ER) is an important part of dialogue
analysis which can be used in order to improve the quality of
Spoken Dialogue Systems (SDSs). The emotional hypothesis
of the current response of an end-user might be utilised by
the dialogue manager component in order to change the SDS
strategy which could result in a quality enhancement. In this
study additional speaker-related information is used to improve
the performance of the speech-based ER process. The analysed
information is the speaker identity, gender and age of a user. Two
schemes are described here, namely, using additional information
as an independent variable within the feature vector and creating
separate emotional models for each speaker, gender or age-cluster
independently. The performances of the proposed approaches
were compared against the baseline ER system, where no
additional information has been used, on a number of emotional
speech corpora of German, English, Japanese and Russian. The
study revealed that for some of the corpora the proposed approach
significantly outperforms the baseline methods with a relative
difference of up to 11.9%.
O3 - Corpora for Language AnalysisWednesday, May 25, 11:35
Chairperson: Stelios Piperidis Oral Session
Discriminative Analysis of Linguistic Features forTypological Study
Hiroya Takamura, Ryo Nagata and Yoshifumi Kawasaki
We address the task of automatically estimating the missing values
of linguistic features by making use of the fact that some linguistic
features in typological databases are informative to each other.
The questions to address in this work are (i) how much predictive
power do features have on the value of another feature? (ii) to
what extent can we attribute this predictive power to genealogical
or areal factors, as opposed to being provided by tendencies or
implicational universals? To address these questions, we conduct
a discriminative or predictive analysis on the typological database.
Specifically, we use a machine-learning classifier to estimate the
value of each feature of each language using the values of the other
features, under different choices of training data: all the other
languages, or all the other languages except for the ones having
the same origin or area with the target language.
POS-tagging of Historical Dutch
Dieuwke Hupkes and Rens Bod
We present a study of the adequacy of current methods that
are used for POS-tagging historical Dutch texts, as well as an
exploration of the influence of employing different techniques to
improve upon the current practice. The main focus of this paper is
on (unsupervised) methods that are easily adaptable for different
domains without requiring extensive manual input. It was found
that modernising the spelling of corpora prior to tagging them with
a tagger trained on contemporary Dutch results in a large increase
in accuracy, but that spelling normalisation alone is not sufficient
to obtain state-of-the-art results. The best results were achieved
by training a POS-tagger on a corpus automatically annotated by
projecting (automatically assigned) POS-tags via word alignments
from a contemporary corpus. This result is promising, as it
was reached without including any domain knowledge or context
dependencies. We argue that the insights of this study combined
with semi-supervised learning techniques for domain adaptation
3
can be used to develop a general-purpose diachronic tagger for
Dutch.
A Language Resource of German Errors Writtenby Children with Dyslexia
Maria Rauschenberger, Luz Rello, Silke Füchsel and JörgThomaschewski
In this paper we present a language resource for German,
composed of a list of 1,021 unique errors extracted from a
collection of texts written by people with dyslexia. The errors
were annotated with a set of linguistic characteristics as well as
visual and phonetic features. We present the compilation and the
annotation criteria for the different types of dyslexic errors. This
language resource has many potential uses since errors written by
people with dyslexia reflect their difficulties. For instance, it has
already been used to design language exercises to treat dyslexia in
German. To the best of our knowledge, this is first resource of this
kind in German.
CItA: an L1 Italian Learners Corpus to Study theDevelopment of Writing Competence
Alessia Barbagli, Pietro Lucisano, Felice Dell’Orletta,Simonetta Montemagni and Giulia Venturi
In this paper, we present the CItA corpus (Corpus Italiano di
Apprendenti L1), a collection of essays written by Italian L1
learners collected during the first and second year of lower
secondary school. The corpus was built in the framework of
an interdisciplinary study jointly carried out by computational
linguistics and experimental pedagogists and aimed at tracking the
development of written language competence over the years and
students’ background information.
If You Even Don’t Have a Bit of Bible: LearningDelexicalized POS Taggers
Zhiwei Yu, David Marecek, Zdenek Žabokrtský and DanielZeman
Part-of-speech (POS) induction is one of the most popular
tasks in research on unsupervised NLP. Various unsupervised
and semi-supervised methods have been proposed to tag an
unseen language. However, many of them require some partial
understanding of the target language because they rely on
dictionaries or parallel corpora such as the Bible. In this paper,
we propose a different method named delexicalized tagging, for
which we only need a raw corpus of the target language. We
transfer tagging models trained on annotated corpora of one or
more resource-rich languages. We employ language-independent
features such as word length, frequency, neighborhood entropy,
character classes (alphabetic vs. numeric vs. punctuation) etc.
We demonstrate that such features can, to certain extent, serve as
predictors of the part of speech, represented by the universal POS
tag.
O4 - Spoken Corpus DialogueWednesday, May 25, 11:35
Chairperson: Asuncion Moreno Oral Session
The SpeDial datasets: datasets for SpokenDialogue Systems analyticsJosé Lopes, Arodami Chorianopoulou, ElisavetPalogiannidi, Helena Moniz, Alberto Abad, KaterinaLouka, Elias Iosif and Alexandros Potamianos
The SpeDial consortium is sharing two datasets that were used
during the SpeDial project. By sharing them with the community
we are providing a resource to reduce the duration of cycle of
development of new Spoken Dialogue Systems (SDSs). The
datasets include audios and several manual annotations, i.e.,
miscommunication, anger, satisfaction, repetition, gender and
task success. The datasets were created with data from real
users and cover two different languages: English and Greek.
Detectors for miscommunication, anger and gender were trained
for both systems. The detectors were particularly accurate
in tasks where humans have high annotator agreement such
as miscommunication and gender. As expected due to the
subjectivity of the task, the anger detector had a less satisfactory
performance. Nevertheless, we proved that the automatic
detection of situations that can lead to problems in SDSs is
possible and can be a promising direction to reduce the duration
of SDS’s development cycle.
Creating Annotated Dialogue Resources:Cross-domain Dialogue Act ClassificationDilafruz Amanova, Volha Petukhova and Dietrich Klakow
This paper describes a method to automatically create dialogue
resources annotated with dialogue act information by reusing
existing dialogue corpora. Numerous dialogue corpora are
available for research purposes and many of them are annotated
with dialogue act information that captures the intentions encoded
in user utterances. Annotated dialogue resources, however, differ
in various respects: data collection settings and modalities used,
dialogue task domains and scenarios (if any) underlying the
collection, number and roles of dialogue participants involved
and dialogue act annotation schemes applied. The presented
study encompasses three phases of data-driven investigation.
We, first, assess the importance of various types of features
and their combinations for effective cross-domain dialogue act
classification. Second, we establish the best predictive model
comparing various cross-corpora training settings. Finally, we
4
specify models adaptation procedures and explore late fusion
approaches to optimize the overall classification decision taking
process. The proposed methodology accounts for empirically
motivated and technically sound classification procedures that
may reduce annotation and training costs significantly.
Towards a Multi-dimensional Taxonomy ofStories in Dialogue
Kathryn J. Collins and David Traum
In this paper, we present a taxonomy of stories told in dialogue.
We based our scheme on prior work analyzing narrative structure
and method of telling, relation to storyteller identity, as well
as some categories particular to dialogue, such as how the
story gets introduced. Our taxonomy currently has 5 major
dimensions, with most having sub-dimensions - each dimension
has an associated set of dimension-specific labels. We adapted
an annotation tool for this taxonomy and have annotated portions
of two different dialogue corpora, Switchboard and the Distress
Analysis Interview Corpus. We present examples of some of the
tags and concepts with stories from Switchboard, and some initial
statistics of frequencies of the tags.
PentoRef: A Corpus of Spoken References inTask-oriented Dialogues
Sina Zarrieß, Julian Hough, Casey Kennington, RameshManuvinakurike, David DeVault, Raquel Fernandez andDavid Schlangen
PentoRef is a corpus of task-oriented dialogues collected in
systematically manipulated settings. The corpus is multilingual,
with English and German sections, and overall comprises more
than 20000 utterances. The dialogues are fully transcribed
and annotated with referring expressions mapped to objects
in corresponding visual scenes, which makes the corpus a
rich resource for research on spoken referring expressions in
generation and resolution. The corpus includes several sub-
corpora that correspond to different dialogue situations where
parameters related to interactivity, visual access, and verbal
channel have been manipulated in systematic ways. The
corpus thus lends itself to very targeted studies of reference in
spontaneous dialogue.
Transfer of Corpus-Specific Dialogue ActAnnotation to ISO Standard: Is it worth it?
Shammur Absar Chowdhury, Evgeny Stepanov andGiuseppe Riccardi
Spoken conversation corpora often adapt existing Dialogue Act
(DA) annotation specifications, such as DAMSL, DIT++, etc.,
to task specific needs, yielding incompatible annotations; thus,
limiting corpora re-usability. Recently accepted ISO standard for
DA annotation – Dialogue Act Markup Language (DiAML) –
is designed as domain and application independent. Moreover,
the clear separation of dialogue dimensions and communicative
functions, coupled with the hierarchical organization of the
latter, allows for classification at different levels of granularity.
However, re-annotating existing corpora with the new scheme
might require significant effort. In this paper we test the utility of
the ISO standard through comparative evaluation of the corpus-
specific legacy and the semi-automatically transferred DiAML
DA annotations on supervised dialogue act classification task. To
test the domain independence of the resulting annotations, we
perform cross-domain and data aggregation evaluation. Compared
to the legacy annotation scheme, on the Italian LUNA Human-
Human corpus, the DiAML annotation scheme exhibits better
cross-domain and data aggregation classification performance,
while maintaining comparable in-domain performance.
P01 - Anaphora and CoreferenceWednesday, May 25, 11:35
Chairperson: Steve Cassidy Poster Session
WikiCoref: An English Coreference-annotatedCorpus of Wikipedia ArticlesAbbas Ghaddar and Phillippe Langlais
This paper presents WikiCoref, an English corpus annotated for
anaphoric relations, where all documents are from the English
version of Wikipedia. Our annotation scheme follows the one of
OntoNotes with a few disparities. We annotated each markable
with coreference type, mention type and the equivalent Freebase
topic. Since most similar annotation efforts concentrate on
very specific types of written text, mainly newswire, there is a
lack of resources for otherwise over-used Wikipedia texts. The
corpus described in this paper addresses this issue. We present
a freely available resource we initially devised for improving
coreference resolution algorithms dedicated to Wikipedia texts.
Our corpus has no restriction on the topics of the documents being
annotated, and documents of various sizes have been considered
for annotation.
Exploitation of Co-reference in DistributionalSemanticsDominik Schlechtweg
The aim of distributional semantics is to model the similarity
of the meaning of words via the words they occur with.
Thereby, it relies on the distributional hypothesis implying
that similar words have similar contexts. Deducing meaning
from the distribution of words is interesting as it can be done
automatically on large amounts of freely available raw text.
5
It is because of this convenience that most current state-of-
the-art-models of distributional semantics operate on raw text,
although there have been successful attempts to integrate other
kinds of—e.g., syntactic—information to improve distributional
semantic models. In contrast, less attention has been paid to
semantic information in the research community. One reason
for this is that the extraction of semantic information from raw
text is a complex, elaborate matter and in great parts not yet
satisfyingly solved. Recently, however, there have been successful
attempts to integrate a certain kind of semantic information,
i.e., co-reference. Two basically different kinds of information
contributed by co-reference with respect to the distribution of
words will be identified. We will then focus on one of these and
examine its general potential to improve distributional semantic
models as well as certain more specific hypotheses.
Adapting an Entity Centric Model for PortugueseCoreference Resolution
Evandro Fonseca, Renata Vieira and Aline Vanin
This paper presents the adaptation of an Entity Centric Model
for Portuguese coreference resolution, considering 10 named
entity categories. The model was evaluated on named e using
the HAREM Portuguese corpus and the results are 81.0% of
precision and 58.3% of recall overall, the resulting system is freely
available.
IMS HotCoref DE: A Data-driven Co-referenceResolver for German
Ina Roesiger and Jonas Kuhn
This paper presents a data-driven co-reference resolution system
for German that has been adapted from IMS HotCoref, a co-
reference resolver for English. It describes the difficulties when
resolving co-reference in German text, the adaptation process and
the features designed to address linguistic challenges brought forth
by German. We report performance on the reference dataset
TüBa-D/Z and include a post-task SemEval 2010 evaluation,
showing that the resolver achieves state-of-the-art performance.
We also include ablation experiments that indicate that integrating
linguistic features increases results. The paper also describes the
steps and the format necessary to use the resolver on new texts.
The tool is freely available for download.
Coreference Annotation Scheme and RelationTypes for Hindi
Vandan Mujadia, Palash Gupta and Dipti Misra Sharma
This paper describes a coreference annotation scheme,
coreference annotation specific issues and their solutions through
our proposed annotation scheme for Hindi. We introduce different
co-reference relation types between continuous mentions of the
same coreference chain such as “Part-of”, “Function-value pair”
etc. We used Jaccard similarity based Krippendorff’s alpha to
demonstrate consistency in annotation scheme, annotation and
corpora. To ease the coreference annotation process, we built
a semi-automatic Coreference Annotation Tool (CAT). We also
provide statistics of coreference annotation on Hindi Dependency
Treebank (HDTB).
Coreference in Prague Czech-English DependencyTreebank
Anna Nedoluzhko, Michal Novák, Silvie Cinkova, MarieMikulová and Jirí Mírovský
We present coreference annotation on parallel Czech-English texts
of the Prague Czech-English Dependency Treebank (PCEDT).
The paper describes innovations made to PCEDT 2.0 concerning
coreference, as well as coreference information already present
there. We characterize the coreference annotation scheme, give
the statistics and compare our annotation with the coreference
annotation in Ontonotes and Prague Dependency Treebank for
Czech. We also present the experiments made using this corpus to
improve the alignment of coreferential expressions, which helps
us to collect better statistics of correspondences between types of
coreferential relations in Czech and English. The corpus released
as PCEDT 2.0 Coref is publicly available.
Sieve-based Coreference Resolution in theBiomedical Domain
Dane Bell, Gus Hahn-Powell, Marco A. Valenzuela-Escárcega and Mihai Surdeanu
We describe challenges and advantages unique to coreference
resolution in the biomedical domain, and a sieve-based
architecture that leverages domain knowledge for both entity
and event coreference resolution. Domain-general coreference
resolution algorithms perform poorly on biomedical documents,
because the cues they rely on such as gender are largely absent
in this domain, and because they do not encode domain-specific
knowledge such as the number and type of participants required
in chemical reactions. Moreover, it is difficult to directly encode
this knowledge into most coreference resolution algorithms
because they are not rule-based. Our rule-based architecture
uses sequentially applied hand-designed “sieves”, with the output
of each sieve informing and constraining subsequent sieves.
This architecture provides a 3.2% increase in throughput to our
Reach event extraction system with precision parallel to that of
6
the stricter system that relies solely on syntactic patterns for
extraction.
Annotating Characters in Literary Corpora: AScheme, the CHARLES Tool, and an AnnotatedNovel
Hardik Vala, Stefan Dimitrov, David Jurgens, Andrew Piperand Derek Ruths
Characters form the focus of various studies of literary works,
including social network analysis, archetype induction, and plot
comparison. The recent rise in the computational modelling of
literary works has produced a proportional rise in the demand
for character-annotated literary corpora. However, automatically
identifying characters is an open problem and there is low
availability of literary texts with manually labelled characters. To
address the latter problem, this work presents three contributions:
(1) a comprehensive scheme for manually resolving mentions
to characters in texts. (2) A novel collaborative annotation
tool, CHARLES (CHAracter Resolution Label-Entry System) for
character annotation and similiar cross-document tagging tasks.
(3) The character annotations resulting from a pilot study on the
novel Pride and Prejudice, demonstrating the scheme and tool
facilitate the efficient production of high-quality annotations. We
expect this work to motivate the further production of annotated
literary corpora to help meet the demand of the community.
P02 - Computer Aided Language LearningWednesday, May 25, 11:35
Chairperson: Stephanie Strassel Poster Session
Error Typology and Remediation Strategies forRequirements Written in English by Non-NativeSpeakers
Marie Garnier and Patrick Saint-Dizier
In most international industries, English is the main language of
communication for technical documents. These documents are
designed to be as unambiguous as possible for their users. For
international industries based in non-English speaking countries,
the professionals in charge of writing requirements are often non-
native speakers of English, who rarely receive adequate training
in the use of English for this task. As a result, requirements
can contain a relatively large diversity of lexical and grammatical
errors, which are not eliminated by the use of guidelines from
controlled languages. This article investigates the distribution
of errors in a corpus of requirements written in English by
native speakers of French. Errors are defined on the basis of
grammaticality and acceptability principles, and classified using
comparable categories. Results show a high proportion of errors
in the Noun Phrase, notably through modifier stacking, and
errors consistent with simplification strategies. Comparisons
with similar corpora in other genres reveal the specificity of
the distribution of errors in requirements. This research also
introduces possible applied uses, in the form of strategies for the
automatic detection of errors, and in-person training provided by
certification boards in requirements authoring.
Improving POS Tagging of German LearnerLanguage in a Reading Comprehension Scenario
Lena Keiper, Andrea Horbach and Stefan Thater
We present a novel method to automatically improve the accurracy
of part-of-speech taggers on learner language. The key idea
underlying our approach is to exploit the structure of a typical
language learner task and automatically induce POS information
for out-of-vocabulary (OOV) words. To evaluate the effectiveness
of our approach, we add manual POS and normalization
information to an existing language learner corpus. Our evaluation
shows an increase in accurracy from 72.4% to 81.5% on OOV
words.
SweLL on the rise: Swedish Learner Languagecorpus for European Reference Level studies
Elena Volodina, Ildikó Pilán, Ingegerd Enström, LorenaLlozhi, Peter Lundkvist, Gunlög Sundberg and MonicaSandell
We present a new resource for Swedish, SweLL, a corpus of
Swedish Learner essays linked to learners’ performance according
to the Common European Framework of Reference (CEFR).
SweLL consists of three subcorpora – SpIn, SW1203 and Tisus,
collected from three different educational establishments. The
common metadata for all subcorpora includes age, gender, native
languages, time of residence in Sweden, type of written task.
Depending on the subcorpus, learner texts may contain additional
information, such as text genres, topics, grades. Five of
the six CEFR levels are represented in the corpus: A1, A2,
B1, B2 and C1 comprising in total 339 essays. C2 level is
not included since courses at C2 level are not offered. The
work flow consists of collection of essays and permits, essay
digitization and registration, meta-data annotation, automatic
linguistic annotation. Inter-rater agreement is presented on the
basis of SW1203 subcorpus. The work on SweLL is still ongoing
with more that 100 essays waiting in the pipeline. This article both
7
describes the resource and the “how-to” behind the compilation of
SweLL.
SVALex: a CEFR-graded Lexical Resource forSwedish Foreign and Second Language LearnersThomas Francois, Elena Volodina, Ildikó Pilán and AnaïsTack
The paper introduces SVALex, a lexical resource primarily
aimed at learners and teachers of Swedish as a foreign and
second language that describes the distribution of 15,681 words
and expressions across the Common European Framework of
Reference (CEFR). The resource is based on a corpus of
coursebook texts, and thus describes receptive vocabulary learners
are exposed to during reading activities, as opposed to productive
vocabulary they use when speaking or writing. The paper
describes the methodology applied to create the list and to estimate
the frequency distribution. It also discusses some characteristics
of the resulting resource and compares it to other lexical resources
for Swedish. An interesting feature of this resource is the
possibility to separate the wheat from the chaff, identifying the
core vocabulary at each level, i.e. vocabulary shared by several
coursebook writers at each level, from peripheral vocabulary
which is used by the minority of the coursebook writers.
Detecting Word Usage Errors in ChineseSentences for Learning Chinese as a ForeignLanguageYow-Ting Shiue and Hsin-Hsi Chen
Automated grammatical error detection, which helps users
improve their writing, is an important application in NLP.
Recently more and more people are learning Chinese, and an
automated error detection system can be helpful for the learners.
This paper proposes n-gram features, dependency count features,
dependency bigram features, and single-character features to
determine if a Chinese sentence contains word usage errors, in
which a word is written as a wrong form or the word selection
is inappropriate. With marking potential errors on the level of
sentence segments, typically delimited by punctuation marks, the
learner can try to correct the problems without the assistant of a
language teacher. Experiments on the HSK corpus show that the
classifier combining all sets of features achieves an accuracy of
0.8423. By utilizing certain combination of the sets of features,
we can construct a system that favors precision or recall. The
best precision we achieve is 0.9536, indicating that our system is
reliable and seldom produces misleading results.
LibN3L:A Lightweight Package for Neural NLPMeishan Zhang, Jie Yang, Zhiyang Teng and Yue Zhang
We present a light-weight machine learning tool for NLP research.
The package supports operations on both discrete and dense
vectors, facilitating implementation of linear models as well as
neural models. It provides several basic layers which mainly
aims for single-layer linear and non-linear transformations. By
using these layers, we can conveniently implement linear models
and simple neural models. Besides, this package also integrates
several complex layers by composing those basic layers, such as
RNN, Attention Pooling, LSTM and gated RNN. Those complex
layers can be used to implement deep neural models directly.
Evaluating Lexical Simplification and VocabularyKnowledge for Learners of French: Possibilities ofUsing the FLELex Resource
Anaïs Tack, Thomas Francois, Anne-Laure Ligozat andCédrick Fairon
This study examines two possibilities of using the FLELex graded
lexicon for the automated assessment of text complexity in French
as a foreign language learning. From the lexical frequency
distributions described in FLELex, we derive a single level of
difficulty for each word in a parallel corpus of original and
simplified texts. We then use this data to automatically address
the lexical complexity of texts in two ways. On the one hand, we
evaluate the degree of lexical simplification in manually simplified
texts with respect to their original version. Our results show
a significant simplification effect, both in the case of French
narratives simplified for non-native readers and in the case of
simplified Wikipedia texts. On the other hand, we define a
predictive model which identifies the number of words in a text
that are expected to be known at a particular learning level.
We assess the accuracy with which these predictions are able to
capture actual word knowledge as reported by Dutch-speaking
learners of French. Our study shows that although the predictions
seem relatively accurate in general (87.4% to 92.3%), they do not
yet seem to cover the learners’ lack of knowledge very well.
A Shared Task for Spoken CALL?
Claudia Baur, Johanna Gerlach, Manny Rayner, MartinRussell and Helmer Strik
We argue that the field of spoken CALL needs a shared task
in order to facilitate comparisons between different groups and
methodologies, and describe a concrete example of such a task,
based on data collected from a speech-enabled online tool which
has been used to help young Swiss German teens practise skills in
English conversation. Items are prompt-response pairs, where the
prompt is a piece of German text and the response is a recorded
English audio file. The task is to label pairs as “accept” or “reject”,
accepting responses which are grammatically and linguistically
correct to match a set of hidden gold standard answers as closely
as possible. Initial resources are provided so that a scratch system
8
can be constructed with a minimal investment of effort, and in
particular without necessarily using a speech recogniser. Training
data for the task will be released in June 2016, and test data in
January 2017.
Joining-in-type Humanoid Robot AssistedLanguage Learning System
AlBara Khalifa, Tsuneo Kato and Seiichi Yamamoto
Dialogue robots are attractive to people, and in language
learning systems, they motivate learners and let them practice
conversational skills in more realistic environment. However,
automatic speech recognition (ASR) of the second language (L2)
learners is still a challenge, because their speech contains not just
pronouncing, lexical, grammatical errors, but is sometimes totally
disordered. Hence, we propose a novel robot assisted language
learning (RALL) system using two robots, one as a teacher and
the other as an advanced learner. The system is designed to
simulate multiparty conversation, expecting implicit learning and
enhancement of predictability of learners’ utterance through an
alignment similar to “interactive alignment”, which is observed
in human-human conversation. We collected a database with the
prototypes, and measured how much the alignment phenomenon
observed in the database with initial analysis.
P03 - Evaluation Methodologies (1)Wednesday, May 25, 11:35
Chairperson: Ann Bies Poster Session
OSMAN – A Novel Arabic Readability Metric
Mahmoud El-Haj and Paul Rayson
We present OSMAN (Open Source Metric for Measuring Arabic
Narratives) - a novel open source Arabic readability metric and
tool. It allows researchers to calculate readability for Arabic text
with and without diacritics. OSMAN is a modified version of
the conventional readability formulas such as Flesch and Fog.
In our work we introduce a novel approach towards counting
short, long and stress syllables in Arabic which is essential for
judging readability of Arabic narratives. We also introduce an
additional factor called “Faseeh” which considers aspects of script
usually dropped in informal Arabic writing. To evaluate our
methods we used Spearman’s correlation metric to compare text
readability for 73,000 parallel sentences from English and Arabic
UN documents. The Arabic sentences were written with the
absence of diacritics and in order to count the number of syllables
we added the diacritics in using an open source tool called
Mishkal. The results show that OSMAN readability formula
correlates well with the English ones making it a useful tool for
researchers and educators working with Arabic text.
Evaluating Interactive System Adaptation
Edouard Geoffrois
Enabling users of intelligent systems to enhance the system
performance by providing feedback on their errors is an important
need. However, the ability of systems to learn from user feedback
is difficult to evaluate in an objective and comparative way.
Indeed, the involvement of real users in the adaptation process
is an impediment to objective evaluation. This issue can be
solved by using an oracle approach, where users are simulated
by oracles having access to the reference test data. Another
difficulty is to find a meaningful metric despite the fact that system
improvements depend on the feedback provided and on the system
itself. A solution is to measure the minimal amount of information
needed to correct all system errors. It can be shown that for
any well defined non interactive task, the interactively supervised
version of the task can be evaluated by combining such an oracle-
based approach and a minimum supervision rate metric. This new
evaluation protocol for adaptive systems is not only expected to
drive progress for such systems, but also to pave the way for a
specialisation of actors along the value chain of their technological
development.
Complementarity, F-score, and NLP Evaluation
Leon Derczynski
This paper addresses the problem of quantifying the differences
between entity extraction systems, where in general only a small
proportion a document should be selected. Comparing overall
accuracy is not very useful in these cases, as small differences
in accuracy may correspond to huge differences in selections
over the target minority class. Conventionally, one may use per-
token complementarity to describe these differences, but it is not
very useful when the set is heavily skewed. In such situations,
which are common in information retrieval and entity recognition,
metrics like precision and recall are typically used to describe
performance. However, precision and recall fail to describe the
differences between sets of objects selected by different decision
strategies, instead just describing the proportional amount of
correct and incorrect objects selected. This paper presents a
method for measuring complementarity for precision, recall and
9
F-score, quantifying the difference between entity extraction
approaches.
DRANZIERA: An Evaluation Protocol ForMulti-Domain Opinion Mining
Mauro Dragoni, Andrea Tettamanzi and Célia da CostaPereira
Opinion Mining is a topic which attracted a lot of interest in the
last years. By observing the literature, it is often hard to replicate
system evaluation due to the unavailability of the data used for
the evaluation or to the lack of details about the protocol used in
the campaign. In this paper, we propose an evaluation protocol,
called DRANZIERA, composed of a multi-domain dataset and
guidelines allowing both to evaluate opinion mining systems in
different contexts (Closed, Semi-Open, and Open) and to compare
them to each other and to a number of baselines.
Evaluating a Topic Modelling Approach toMeasuring Corpus Similarity
Richard Fothergill, Paul Cook and Timothy Baldwin
Web corpora are often constructed automatically, and their
contents are therefore often not well understood. One technique
for assessing the composition of such a web corpus is to
empirically measure its similarity to a reference corpus whose
composition is known. In this paper we evaluate a number of
measures of corpus similarity, including a method based on topic
modelling which has not been previously evaluated for this task.
To evaluate these methods we use known-similarity corpora that
have been previously used for this purpose, as well as a number of
newly-constructed known-similarity corpora targeting differences
in genre, topic, time, and region. Our findings indicate that,
overall, the topic modelling approach did not improve on a chi-
square method that had previously been found to work well for
measuring corpus similarity.
User, who art thou? User Profiling for OralCorpus Platforms
Christian Fandrych, Elena Frick, Hanna Hedeland, AnnaIliash, Daniel Jettka, Cordula Meißner, Thomas Schmidt,Franziska Wallner, Kathrin Weigert and Swantje Westpfahl
This contribution presents the background, design and results of a
study of users of three oral corpus platforms in Germany. Roughly
5.000 registered users of the Database for Spoken German (DGD),
the GeWiss corpus and the corpora of the Hamburg Centre
for Language Corpora (HZSK) were asked to participate in a
user survey. This quantitative approach was complemented by
qualitative interviews with selected users. We briefly introduce
the corpus resources involved in the study in section 2. Section
3 describes the methods employed in the user studies. Section
4 summarizes results of the studies focusing on selected key
topics. Section 5 attempts a generalization of these results to larger
contexts.
Building a Corpus of Errors and Quality inMachine Translation: Experiments on ErrorImpact
Angela Costa, Rui Correia and Luisa Coheur
In this paper we describe a corpus of automatic translations
annotated with both error type and quality. The 300 sentences
that we have selected were generated by Google Translate,
Systran and two in-house Machine Translation systems that
use Moses technology. The errors present on the translations
were annotated with an error taxonomy that divides errors in
five main linguistic categories (Orthography, Lexis, Grammar,
Semantics and Discourse), reflecting the language level where
the error is located. After the error annotation process, we
accessed the translation quality of each sentence using a four
point comprehension scale from 1 to 5. Both tasks of error and
quality annotation were performed by two different annotators,
achieving good levels of inter-annotator agreement. The creation
of this corpus allowed us to use it as training data for a translation
quality classifier. We concluded on error severity by observing the
outputs of two machine learning classifiers: a decision tree and a
regression model.
Evaluating the Readability of Text SimplificationOutput for Readers with Cognitive Disabilities
Victoria Yaneva, Irina Temnikova and Ruslan Mitkov
This paper presents an approach for automatic evaluation of the
readability of text simplification output for readers with cognitive
disabilities. First, we present our work towards the development
of the EasyRead corpus, which contains easy-to-read documents
created especially for people with cognitive disabilities. We
then compare the EasyRead corpus to the simplified output
contained in the LocalNews corpus (Feng, 2009), the accessibility
of which has been evaluated through reading comprehension
experiments including 20 adults with mild intellectual disability.
This comparison is made on the basis of 13 disability-specific
linguistic features. The comparison reveals that there are no
major differences between the two corpora, which shows that
the EasyRead corpus is to a similar reading level as the user-
evaluated texts. We also discuss the role of Simple Wikipedia
(Zhu et al., 2010) as a widely-used accessibility benchmark, in
10
light of our finding that it is significantly more complex than both
the EasyRead and the LocalNews corpora.
Word Embedding Evaluation and CombinationSahar Ghannay, Benoit Favre, Yannick Estève and NathalieCamelin
Word embeddings have been successfully used in several natural
language processing tasks (NLP) and speech processing. Different
approaches have been introduced to calculate word embeddings
through neural networks. In the literature, many studies focused
on word embedding evaluation, but for our knowledge, there
are still some gaps. This paper presents a study focusing on a
rigorous comparison of the performances of different kinds of
word embeddings. These performances are evaluated on different
NLP and linguistic tasks, while all the word embeddings are
estimated on the same training data using the same vocabulary,
the same number of dimensions, and other similar characteristics.
The evaluation results reported in this paper match those in the
literature, since they point out that the improvements achieved
by a word embedding in one task are not consistently observed
across all tasks. For that reason, this paper investigates and
evaluates approaches to combine word embeddings in order to
take advantage of their complementarity, and to look for the
effective word embeddings that can achieve good performances on
all tasks. As a conclusion, this paper provides new perceptions of
intrinsic qualities of the famous word embedding families, which
can be different from the ones provided by works previously
published in the scientific literature.
Benchmarking multimedia technologies with theCAMOMILE platform: the case of MultimodalPerson Discovery at MediaEval 2015Johann Poignant, Hervé Bredin, Claude Barras, MickaelStefas, Pierrick Bruneau and Thomas Tamisier
In this paper, we claim that the CAMOMILE collaborative
annotation platform (developed in the framework of the
eponymous CHIST-ERA project) eases the organization of
multimedia technology benchmarks, automating most of the
campaign technical workflow and enabling collaborative (hence
faster and cheaper) annotation of the evaluation data. This
is demonstrated through the successful organization of a
new multimedia task at MediaEval 2015, Multimodal Person
Discovery in Broadcast TV.
Evaluating the Impact of Light Post-Editing onUsabilitySheila Castilho and Sharon O’Brien
This paper discusses a methodology to measure the usability of
machine translated content by end users, comparing lightly post-
edited content with raw output and with the usability of source
language content. The content selected consists of Online Help
articles from a software company for a spreadsheet application,
translated from English into German. Three groups of five users
each used either the source text - the English version (EN) -, the
raw MT version (DE_ MT), or the light PE version (DE_ PE), and
were asked to carry out six tasks. Usability was measured using
an eye tracker and cognitive, temporal and pragmatic measures of
usability. Satisfaction was measured via a post-task questionnaire
presented after the participants had completed the tasks.
P04 - Information Extraction andRetrieval (1)Wednesday, May 25, 11:35
Chairperson: Diana Maynard Poster Session
Operational Assessment of Keyword Search onOral History
Elizabeth Salesky, Jessica Ray and Wade Shen
This project assesses the resources necessary to make oral
history searchable by means of automatic speech recognition
(ASR). There are many inherent challenges in applying ASR
to conversational speech: smaller training set sizes and varying
demographics, among others. We assess the impact of dataset
size, word error rate and term-weighted value on human search
capability through an information retrieval task on Mechanical
Turk. We use English oral history data collected by StoryCorps, a
national organization that provides all people with the opportunity
to record, share and preserve their stories, and control for a
variety of demographics including age, gender, birthplace, and
dialect on four different training set sizes. We show comparable
search performance using a standard speech recognition system
as with hand-transcribed data, which is promising for increased
accessibility of conversational speech and oral history archives.
Odin’s Runes: A Rule Language for InformationExtraction
Marco A. Valenzuela-Escárcega, Gus Hahn-Powell andMihai Surdeanu
Odin is an information extraction framework that applies cascades
of finite state automata over both surface text and syntactic
dependency graphs. Support for syntactic patterns allow us to
concisely define relations that are otherwise difficult to express
in languages such as Common Pattern Specification Language
(CPSL), which are currently limited to shallow linguistic features.
The interaction of lexical and syntactic automata provides
robustness and flexibility when writing extraction rules. This
11
paper describes Odin’s declarative language for writing these
cascaded automata.
A Classification-based Approach to EconomicEvent Detection in Dutch News Text
Els Lefever and Véronique Hoste
Breaking news on economic events such as stock splits or
mergers and acquisitions has been shown to have a substantial
impact on the financial markets. As it is important to be
able to automatically identify events in news items accurately
and in a timely manner, we present in this paper proof-of-
concept experiments for a supervised machine learning approach
to economic event detection in newswire text. For this purpose, we
created a corpus of Dutch financial news articles in which 10 types
of company-specific economic events were annotated. We trained
classifiers using various lexical, syntactic and semantic features.
We obtain good results based on a basic set of shallow features,
thus showing that this method is a viable approach for economic
event detection in news text.
Predictive Modeling: Guessing the NLP Terms ofTomorrow
Gil Francopoulo, Joseph Mariani and Patrick Paroubek
Predictive modeling, often called “predictive analytics” in
a commercial context, encompasses a variety of statistical
techniques that analyze historical and present facts to make
predictions about unknown events. Often the unknown events
are in the future, but prediction can be applied to any type of
unknown whether it be in the past or future. In our case, we
present some experiments applying predictive modeling to the
usage of technical terms within the NLP domain.
The Gavagai Living Lexicon
Magnus Sahlgren, Amaru Cuba Gyllensten, FredrikEspinoza, Ola Hamfors, Jussi Karlgren, Fredrik Olsson,Per Persson, Akshay Viswanathan and Anders Holst
This paper presents the Gavagai Living Lexicon, which is an
online distributional semantic model currently available in 20
different languages. We describe the underlying distributional
semantic model, and how we have solved some of the challenges
in applying such a model to large amounts of streaming data. We
also describe the architecture of our implementation, and discuss
how we deal with continuous quality assurance of the lexicon.
Arabic to English Person Name Transliterationusing Twitter
Hamdy Mubarak and Ahmed Abdelali
Social media outlets are providing new opportunities for
harvesting valuable resources. We present a novel approach
for mining data from Twitter for the purpose of building
transliteration resources and systems. Such resources are crucial
in translation and retrieval tasks. We demonstrate the benefits
of the approach on Arabic to English transliteration. The
contribution of this approach includes the size of data that
can be collected and exploited within the span of a limited
time; the approach is very generic and can be adopted to other
languages and the ability of the approach to cope with new
transliteration phenomena and trends. A statistical transliteration
system built using this data improved a comparable system built
from Wikipedia wikilinks data.
Korean TimeML and Korean TimeBankYoung-Seob Jeong, Won-Tae Joo, Hyun-Woo Do, Chae-Gyun Lim, Key-Sun Choi and Ho-Jin Choi
Many emerging documents usually contain temporal information.
Because the temporal information is useful for various
applications, it became important to develop a system of
extracting the temporal information from the documents. Before
developing the system, it first necessary to define or design the
structure of temporal information. In other words, it is necessary
to design a language which defines how to annotate the temporal
information. There have been some studies about the annotation
languages, but most of them was applicable to only a specific
target language (e.g., English). Thus, it is necessary to design
an individual annotation language for each language. In this
paper, we propose a revised version of Koreain Time Mark-
up Language (K-TimeML), and also introduce a dataset, named
Korean TimeBank, that is constructed basd on the K-TimeML.
We believe that the new K-TimeML and Korean TimeBank will
be used in many further researches about extraction of temporal
information.
A Large DataBase of Hypernymy RelationsExtracted from the Web.Julian Seitner, Christian Bizer, Kai Eckert, Stefano Faralli,Robert Meusel, Heiko Paulheim and Simone Paolo Ponzetto
Hypernymy relations (those where an hyponym term shares a
“isa”relationship with his hypernym) play a key role for many
Natural Language Processing (NLP) tasks, e.g. ontology learning,
automatically building or extending knowledge bases, or word
sense disambiguation and induction. In fact, such relations may
provide the basis for the construction of more complex structures
such as taxonomies, or be used as effective background knowledge
for many word understanding applications. We present a publicly
available database containing more than 400 million hypernymy
relations we extracted from the CommonCrawl web corpus. We
describe the infrastructure we developed to iterate over the web
12
corpus for extracting the hypernymy relations and store them
effectively into a large database. This collection of relations
represents a rich source of knowledge and may be useful for
many researchers. We offer the tuple dataset for public download
and an Application Programming Interface (API) to help other
researchers programmatically query the database.
Using a Cross-Language Information RetrievalSystem based on OHSUMED to Evaluate theMoses and KantanMT Statistical MachineTranslation Systems
Nikolaos Katris, Richard Sutcliffe and TheodoreKalamboukis
The objective of this paper was to evaluate the performance of
two statistical machine translation (SMT) systems within a cross-
language information retrieval (CLIR) architecture and examine
if there is a correlation between translation quality and CLIR
performance. The SMT systems were KantanMT, a cloud-based
machine translation (MT) platform, and Moses, an open-source
MT application. First we trained both systems using the same
language resources: the EMEA corpus for the translation model
and language model and the QTLP corpus for tuning. Then
we translated the 63 queries of the OHSUMED test collection
from Greek into English using both MT systems. Next, we ran
the queries on the document collection using Apache Solr to
get a list of the top ten matches. The results were compared
to the OHSUMED gold standard. KantanMT achieved higher
average precision and F-measure than Moses, while both systems
produced the same recall score. We also calculated the BLEU
score for each system using the ECDC corpus. Moses achieved
a higher BLEU score than KantanMT. Finally, we also tested the
IR performance of the original English queries. This work overall
showed that CLIR performance can be better even when BLEU
score is worse.
Two Decades of Terminology: EuropeanFramework Programmes Titles
Gabriella Pardelli, Sara Goggi, Silvia Giannini andStefania Biagioni
This work analyses a corpus made of the titles of research projects
belonging to the last four European Commission Framework
Programmes (FP4, FP5, FP6, FP7) during a time span of
nearly two decades (1994-2012). The starting point is the
idea of creating a corpus of titles which would constitute a
terminological niche, a sort of “cluster map” offering an overall
vision on the terms used and the links between them. Moreover,
by performing a terminological comparison over a period of
time it is possible to trace the presence of obsolete words
in outdated research areas as well as of neologisms in the
most recent fields. Within this scenario, the minimal purpose
is to build a corpus of titles of European projects belonging
to the several Framework Programmes in order to obtain a
terminological mapping of relevant words in the various research
areas: particularly significant would be those terms spread across
different domains or those extremely tied to a specific domain.
A term could actually be found in many fields and being able to
acknowledge and retrieve this cross-presence means being able
to linking those different domains by means of a process of
terminological mapping.
Legal Text Interpretation: Identifying HohfeldianRelations from Text
Wim Peters and Adam Wyner
The paper investigates the extent of the support semi-automatic
analysis can provide for the specific task of assigning Hohfeldian
relations of Duty, using the General Architecture for Text
Engineering tool for the automated extraction of Duty instances
and the bearers of associated roles. The outcome of the analysis
supports scholars in identifying Hohfeldian structures in legal
text when performing close reading of the texts. A cyclic
workflow involving automated annotation and expert feedback
will incrementally increase the quality and coverage of the
automatic extraction process, and increasingly reduce the amount
of manual work required of the scholar.
Analysis of English Spelling Errors in aWord-Typing Game
Ryuichi Tachibana and Mamoru Komachi
The emergence of the web has necessitated the need to detect
and correct noisy consumer-generated texts. Most of the previous
studies on English spelling-error extraction collected English
spelling errors from web services such as Twitter by using the edit
distance or from input logs utilizing crowdsourcing. However, in
the former approach, it is not clear which word corresponds to the
spelling error, and the latter approach requires an annotation cost
for the crowdsourcing. One notable exception is Rodrigues and
Rytting (2012), who proposed to extract English spelling errors
by using a word-typing game. Their approach saves the cost
of crowdsourcing, and guarantees an exact alignment between
the word and the spelling error. However, they did not assert
whether the extracted spelling error corpora reflect the usual
writing process such as writing a document. Therefore, we
propose a new correctable word-typing game that is more similar
13
to the actual writing process. Experimental results showed that we
can regard typing-game logs as a source of spelling errors.
Finding Definitions in Large Corpora with SketchEngine
Vojtech Kovár, Monika Mociariková and Pavel Rychlý
The paper describes automatic definition finding implemented
within the leading corpus query and management tool, Sketch
Engine. The implementation exploits complex pattern-matching
queries in the corpus query language (CQL) and the indexing
mechanism of word sketches for finding and storing definition
candidates throughout the corpus. The approach is evaluated for
Czech and English corpora, showing that the results are usable
in practice: precision of the tool ranges between 30 and 75
percent (depending on the major corpus text types) and we were
able to extract nearly 2 million definition candidates from an
English corpus with 1.4 billion words. The feature is embedded
into the interface as a concordance filter, so that users can
search for definitions of any query to the corpus, including very
specific multi-word queries. The results also indicate that ordinary
texts (unlike explanatory texts) contain rather low number of
definitions, which is perhaps the most important problem with
automatic definition finding in general.
Improving Information Extraction fromWikipedia Texts using Basic English
Teresa Rodriguez-Ferreira, Adrian Rabadan, RaquelHervas and Alberto Diaz
The aim of this paper is to study the effect that the use of Basic
English versus common English has on information extraction
from online resources. The amount of online information
available to the public grows exponentially, and is potentially
an excellent resource for information extraction. The problem
is that this information often comes in an unstructured format,
such as plain text. In order to retrieve knowledge from this type
of text, it must first be analysed to find the relevant details, and
the nature of the language used can greatly impact the quality
of the extracted information. In this paper, we compare triplets
that represent definitions or properties of concepts obtained from
three online collaborative resources (English Wikipedia, Simple
English Wikipedia and Simple English Wiktionary) and study the
differences in the results when Basic English is used instead of
common English. The results show that resources written in Basic
English produce less quantity of triplets, but with higher quality.
NLP and Public Engagement: The Case of theItalian School Reform
Tommaso Caselli, Giovanni Moretti, Rachele Sprugnoli,Sara Tonelli, Damien Lanfrey and Donatella SoldaKutzmann
In this paper we present PIERINO (PIattaforma per l’Estrazione
e il Recupero di INformazione Online), a system that was
implemented in collaboration with the Italian Ministry of
Education, University and Research to analyse the citizens’
comments given in #labuonascuola survey. The platform
includes various levels of automatic analysis such as key-concept
extraction and word co-occurrences. Each analysis is displayed
through an intuitive view using different types of visualizations,
for example radar charts and sunburst. PIERINO was effectively
used to support shaping the last Italian school reform, proving the
potential of NLP in the context of policy making.
Evaluating Translation Quality and CLIRPerformance of Query Sessions
Xabier Saralegi, Eneko Agirre and Iñaki Alegria
This paper presents the evaluation of the translation quality
and Cross-Lingual Information Retrieval (CLIR) performance
when using session information as the context of queries. The
hypothesis is that previous queries provide context that helps to
solve ambiguous translations in the current query. We tested
several strategies on the TREC 2010 Session track dataset,
which includes query reformulations grouped by generalization,
specification, and drifting types. We study the Basque to
English direction, evaluating both the translation quality and CLIR
performance, with positive results in both cases. The results show
that the quality of translation improved, reducing error rate by
12% (HTER) when using session information, which improved
CLIR results 5% (nDCG). We also provide an analysis of the
improvements across the three kinds of sessions: generalization,
specification, and drifting. Translation quality improved in all
three types (generalization, specification, and drifting), and CLIR
improved for generalization and specification sessions, preserving
the performance in drifting sessions.
Construction and Analysis of a Large VietnameseText Corpus
Dieu-Thu Le and Uwe Quasthoff
This paper presents a new Vietnamese text corpus which contains
around 4.05 billion words. It is a collection of Wikipedia texts,
newspaper articles and random web texts. The paper describes the
14
process of collecting, cleaning and creating the corpus. Processing
Vietnamese texts faced several challenges, for example, different
from many Latin languages, Vietnamese language does not use
blanks for separating words, hence using common tokenizers such
as replacing blanks with word boundary does not work. A short
review about different approaches of Vietnamese tokenization
is presented together with how the corpus has been processed
and created. After that, some statistical analysis on this data is
reported including the number of syllable, average word length,
sentence length and topic analysis. The corpus is integrated into a
framework which allows searching and browsing. Using this web
interface, users can find out how many times a particular word
appears in the corpus, sample sentences where this word occurs,
its left and right neighbors.
Forecasting Emerging Trends from ScientificLiterature
Kartik Asooja, Georgeta Bordea, Gabriela Vulcu and PaulBuitelaar
Text analysis methods for the automatic identification of
emerging technologies by analyzing the scientific publications,
are gaining attention because of their socio-economic impact.
The approaches so far have been mainly focused on retrospective
analysis by mapping scientific topic evolution over time. We
propose regression based approaches to predict future keyword
distribution. The prediction is based on historical data of the
keywords, which in our case, are LREC conference proceedings.
Considering the insufficient number of data points available from
LREC proceedings, we do not employ standard time series
forecasting methods. We form a dataset by extracting the
keywords from previous year proceedings and quantify their
yearly relevance using tf-idf scores. This dataset additionally
contains ranked lists of related keywords and experts for each
keyword.
Extracting Structured Scholarly Informationfrom the Machine Translation Literature
Eunsol Choi, Matic Horvat, Jonathan May, Kevin Knightand Daniel Marcu
Understanding the experimental results of a scientific paper
is crucial to understanding its contribution and to comparing
it with related work. We introduce a structured, queryable
representation for experimental results and a baseline system that
automatically populates this representation. The representation
can answer compositional questions such as: “Which are the
best published results reported on the NIST 09 Chinese to
English dataset?” and “What are the most important methods
for speeding up phrase-based decoding?” Answering such
questions usually involves lengthy literature surveys. Current
machine reading for academic papers does not usually consider
the actual experiments, but mostly focuses on understanding
abstracts. We describe annotation work to create an initial
hscientific paper; experimental results representationi corpus. The
corpus is composed of 67 papers which were manually annotated
with a structured representation of experimental results by domain
experts. Additionally, we present a baseline algorithm that
characterizes the difficulty of the inference task.
Staggered NLP-assisted refinement for ClinicalAnnotations of Chronic Disease EventsStephen Wu, Chung-Il Wi, Sunghwan Sohn, Hongfang Liuand Young Juhn
Domain-specific annotations for NLP are often centered on real-
world applications of text, and incorrect annotations may be
particularly unacceptable. In medical text, the process of manual
chart review (of a patient’s medical record) is error-prone due to
its complexity. We propose a staggered NLP-assisted approach to
the refinement of clinical annotations, an interactive process that
allows initial human judgments to be verified or falsified by means
of comparison with an improving NLP system. We show on our
internal Asthma Timelines dataset that this approach improves the
quality of the human-produced clinical annotations.
“Who was Pietro Badoglio?” Towards a QAsystem for Italian HistoryStefano Menini, Rachele Sprugnoli and Antonio Uva
This paper presents QUANDHO (QUestion ANswering Data for
italian HistOry), an Italian question answering dataset created
to cover a specific domain, i.e. the history of Italy in the first
half of the XX century. The dataset includes questions manually
classified and annotated with Lexical Answer Types, and a set of
question-answer pairs. This resource, freely available for research
purposes, has been used to retrain a domain independent question
answering system so to improve its performances in the domain of
interest. Ongoing experiments on the development of a question
classifier and an automatic tagger of Lexical Answer Types are
also presented.
O5 - LR Infrastructures and ArchitecturesWednesday, May 25, 14:45
Chairperson: Franciska de Jong Oral Session
A Document Repository for Social Media andSpeech ConversationsAdam Funk, Robert Gaizauskas and Benoit Favre
We present a successfully implemented document repository
REST service for flexible SCRUD (search, crate, read,
15
update, delete) storage of social media conversations, using a
GATE/TIPSTER-like document object model and providing a
query language for document features. This software is currently
being used in the SENSEI research project and will be published
as open-source software before the project ends. It is, to the best
of our knowledge, the first freely available, general purpose data
repository to support large-scale multimodal (i.e., speech or text)
conversation analytics.
Towards a Linguistic Ontology with an Emphasison Reasoning and Knowledge Reuse
Artemis Parvizi, Matt Kohl, Meritxell Gonzàlez and RoserSaurí
The Dictionaries division at Oxford University Press (OUP)
is aiming to model, integrate, and publish lexical content
for 100 languages focussing on digitally under-represented
languages. While there are multiple ontologies designed for
linguistic resources, none had adequate features for meeting our
requirements, chief of which was the capability to losslessly
capture diverse features of many different languages in a
dictionary format, while supplying a framework for inferring
relations like translation, derivation, etc., between the data.
Building on valuable features of existing models, and working
with OUP monolingual and bilingual dictionary datasets, we
have designed and implemented a new linguistic ontology. The
ontology has been reviewed by a number of computational
linguists, and we are working to move more dictionary data into
it. We have also developed APIs to surface the linked data to
dictionary websites.
Providing a Catalogue of Language Resources forCommercial Users
Bente Maegaard, Lina Henriksen, Andrew Joscelyne, VesnaLusicky, Margaretha Mazura, Sussi Olsen, Claus Povlsenand Philippe Wacker
Language resources (LR) are indispensable for the development of
tools for machine translation (MT) or various kinds of computer-
assisted translation (CAT). In particular language corpora, both
parallel and monolingual are considered most important for
instance for MT, not only SMT but also hybrid MT. The Language
Technology Observatory will provide easy access to information
about LRs deemed to be useful for MT and other translation tools
through its LR Catalogue. In order to determine what aspects of
an LR are useful for MT practitioners, a user study was made,
providing a guide to the most relevant metadata and the most
relevant quality criteria. We have seen that many resources exist
which are useful for MT and similar work, but the majority are
for (academic) research or educational use only, and as such not
available for commercial use. Our work has revealed a list of gaps:
coverage gap, awareness gap, quality gap, quantity gap. The paper
ends with recommendations for a forward-looking strategy.
The Language Application Grid and Galaxy
Nancy Ide, Keith Suderman, James Pustejovsky, MarcVerhagen and Christopher Cieri
The NSF-SI2-funded LAPPS Grid project is a collaborative
effort among Brandeis University, Vassar College, Carnegie-
Mellon University (CMU), and the Linguistic Data Consortium
(LDC), which has developed an open, web-based infrastructure
through which resources can be easily accessed and within
which tailored language services can be efficiently composed,
evaluated, disseminated and consumed by researchers, developers,
and students across a wide variety of disciplines. The LAPPS
Grid project recently adopted Galaxy (Giardine et al., 2005),
a robust, well-developed, and well-supported front end for
workflow configuration, management, and persistence. Galaxy
allows data inputs and processing steps to be selected from
graphical menus, and results are displayed in intuitive plots
and summaries that encourage interactive workflows and the
exploration of hypotheses. The Galaxy workflow engine provides
significant advantages for deploying pipelines of LAPPS Grid web
services, including not only means to create and deploy locally-
run and even customized versions of the LAPPS Grid as well as
running the LAPPS Grid in the cloud, but also access to a huge
array of statistical and visualization tools that have been developed
for use in genomics research.
ELRA Activities and Services
Khalid Choukri, Valérie Mapelli, Hélène Mazo andVladimir Popescu
After celebrating its 20th anniversary in 2015, ELRA is carrying
on its strong involvement in the HLT field. To share ELRA’s
expertise of those 21 past years, this article begins with a
presentation of ELRA’s strategic Data and LR Management Plan
for a wide use by the language communities. Then, we further
report on ELRA’s activities and services provided since LREC
2014. When looking at the cataloguing and licensing activities,
we can see that ELRA has been active at making the Meta-Share
repository move toward new developments steps, supporting
Europe to obtain accurate LRs within the Connecting Europe
Facility programme, promoting the use of LR citation, creating the
ELRA License Wizard web portal.The article further elaborates on
the recent LR production activities of various written, speech and
video resources, commissioned by public and private customers.
In parallel, ELDA has also worked on several EU-funded projects
centred on strategic issues related to the European Digital Single
16
Market. The last part gives an overview of the latest dissemination
activities, with a special focus on the celebration of its 20th
anniversary organised in Dubrovnik (Croatia) and the following
up of LREC, as well as the launching of the new ELRA portal.
O6 - MultimodalityWednesday, May 25, 14:45
Chairperson: Kristiina Jokinen Oral Session
Mirroring Facial Expressions and Emotions inDyadic Conversations
Costanza Navarretta
This paper presents an investigation of mirroring facial
expressions and the emotions which they convey in dyadic
naturally occurring first encounters. Mirroring facial expressions
are a common phenomenon in face-to-face interactions, and they
are due to the mirror neuron system which has been found in
both animals and humans. Researchers have proposed that the
mirror neuron system is an important component behind many
cognitive processes such as action learning and understanding the
emotions of others. Preceding studies of the first encounters have
shown that overlapping speech and overlapping facial expressions
are very frequent. In this study, we want to determine whether
the overlapping facial expressions are mirrored or are otherwise
correlated in the encounters, and to what extent mirroring facial
expressions convey the same emotion. The results of our study
show that the majority of smiles and laughs, and one fifth of the
occurrences of raised eyebrows are mirrored in the data. Moreover
some facial traits in co-occurring expressions co-occur more often
than it would be expected by chance. Finally, amusement, and
to a lesser extent friendliness, are often emotions shared by both
participants, while other emotions indicating individual affective
states such as uncertainty and hesitancy are never showed by both
participants, but co-occur with complementary emotions such as
friendliness and support. Whether these tendencies are specific
to this type of conversations or are more common should be
investigated further.
Humor in Collective Discourse: UnsupervisedFunniness Detection in the New Yorker CartoonCaption Contest
Dragomir Radev, Amanda Stent, Joel Tetreault, AasishPappu, Aikaterini Iliakopoulou, Agustin Chanfreau,Paloma de Juan, Jordi Vallmitjana, Alejandro Jaimes,Rahul Jha and Robert Mankoff
The New Yorker publishes a weekly captionless cartoon. More
than 5,000 readers submit captions for it. The editors select three
of them and ask the readers to pick the funniest one. We describe
an experiment that compares a dozen automatic methods for
selecting the funniest caption. We show that negative sentiment,
human-centeredness, and lexical centrality most strongly match
the funniest captions, followed by positive sentiment. These
results are useful for understanding humor and also in the design
of more engaging conversational agents in text and multimodal
(vision+text) systems. As part of this work, a large set of cartoons
and captions is being made available to the community.
A Corpus of Text Data and Gaze Fixations fromAutistic and Non-Autistic Adults
Victoria Yaneva, Irina Temnikova and Ruslan Mitkov
The paper presents a corpus of text data and its corresponding
gaze fixations obtained from autistic and non-autistic readers.
The data was elicited through reading comprehension testing
combined with eye-tracking recording. The corpus consists of
1034 content words tagged with their POS, syntactic role and three
gaze-based measures corresponding to the autistic and control
participants. The reading skills of the participants were measured
through multiple-choice questions and, based on the answers
given, they were divided into groups of skillful and less-skillful
readers. This division of the groups informs researchers on
whether particular fixations were elicited from skillful or less-
skillful readers and allows a fair between-group comparison for
two levels of reading ability. In addition to describing the process
of data collection and corpus development, we present a study on
the effect that word length has on reading in autism. The corpus is
intended as a resource for investigating the particular linguistic
constructions which pose reading difficulties for people with
autism and hopefully, as a way to inform future text simplification
research intended for this population.
A Multimodal Corpus for the Assessment ofPublic Speaking Ability and Anxiety
Mathieu Chollet, Torsten Wörtwein, Louis-PhilippeMorency and Stefan Scherer
The ability to efficiently speak in public is an essential asset for
many professions and is used in everyday life. As such, tools
enabling the improvement of public speaking performance and the
assessment and mitigation of anxiety related to public speaking
would be very useful. Multimodal interaction technologies,
such as computer vision and embodied conversational agents,
have recently been investigated for the training and assessment
of interpersonal skills. Once central requirement for these
technologies is multimodal corpora for training machine learning
models. This paper addresses the need of these technologies
by presenting and sharing a multimodal corpus of public
17
speaking presentations. These presentations were collected in
an experimental study investigating the potential of interactive
virtual audiences for public speaking training. This corpus
includes audio-visual data and automatically extracted features,
measures of public speaking anxiety and personality, annotations
of participants’ behaviors and expert ratings of behavioral aspects
and overall performance of the presenters. We hope this corpus
will help other research teams in developing tools for supporting
public speaking training.
Deep Learning of Audio and Language Featuresfor Humor Prediction
Dario Bertero and Pascale Fung
We propose a comparison between various supervised machine
learning methods to predict and detect humor in dialogues.
We retrieve our humorous dialogues from a very popular TV
sitcom: “The Big Bang Theory”. We build a corpus where
punchlines are annotated using the canned laughter embedded in
the audio track. Our comparative study involves a linear-chain
Conditional Random Field over a Recurrent Neural Network and
a Convolutional Neural Network. Using a combination of word-
level and audio frame-level features, the CNN outperforms the
other methods, obtaining the best F-score of 68.5% over 66.5%
by CRF and 52.9% by RNN. Our work is a starting point to
developing more effective machine learning and neural network
models on the humor prediction task, as well as developing
machines capable in understanding humor in general.
O7 - Multiword ExpressionsWednesday, May 25, 14:45
Chairperson: Aline Villavicencio Oral Session
An Empirical Study of Arabic FormulaicSequence Extraction Methods
Ayman Alghamdi, Eric Atwell and Claire Brierley
This paper aims to implement what is referred to as the
collocation of the Arabic keywords approach for extracting
formulaic sequences (FSs) in the form of high frequency but
semantically regular formulas that are not restricted to any
syntactic construction or semantic domain. The study applies
several distributional semantic models in order to automatically
extract relevant FSs related to Arabic keywords. The data sets
used in this experiment are rendered from a new developed
corpus-based Arabic wordlist consisting of 5,189 lexical items
which represent a variety of modern standard Arabic (MSA)
genres and regions, the new wordlist being based on an
overlapping frequency based on a comprehensive comparison of
four large Arabic corpora with a total size of over 8 billion running
words. Empirical n-best precision evaluation methods are used
to determine the best association measures (AMs) for extracting
high frequency and meaningful FSs. The gold standard reference
FSs list was developed in previous studies and manually evaluated
against well-established quantitative and qualitative criteria. The
results demonstrate that the MI.log_ f AM achieved the highest
results in extracting significant FSs from the large MSA corpus,
while the T-score association measure achieved the worst results.
Rule-based Automatic Multi-word TermExtraction and Lemmatization
Ranka Stankovic, Cvetana Krstev, Ivan Obradovic, BiljanaLazic and Aleksandra Trtovac
In this paper we present a rule-based method for multi-word
term extraction that relies on extensive lexical resources in the
form of electronic dictionaries and finite-state transducers for
modelling various syntactic structures of multi-word terms. The
same technology is used for lemmatization of extracted multi-
word terms, which is unavoidable for highly inflected languages
in order to pass extracted data to evaluators and subsequently
to terminological e-dictionaries and databases. The approach is
illustrated on a corpus of Serbian texts from the mining domain
containing more than 600,000 simple word forms. Extracted and
lemmatized multi-word terms are filtered in order to reject falsely
offered lemmas and then ranked by introducing measures that
combine linguistic and statistical information (C-Value, T-Score,
LLR, and Keyness). Mean average precision for retrieval of MWU
forms ranges from 0.789 to 0.804, while mean average precision
of lemma production ranges from 0.956 to 0.960. The evaluation
showed that 94% of distinct multi-word forms were evaluated as
proper multi-word units, and among them 97% were associated
with correct lemmas.
Distribution of Valency Complements in CzechComplex Predicates: Between Verb and Noun
Václava Kettnerová and Eduard Bejcek
In this paper, we focus on Czech complex predicates formed
by a light verb and a predicative noun expressed as the
direct object. Although Czech – as an inflectional language
encoding syntactic relations via morphological cases – provides
an excellent opportunity to study the distribution of valency
complements in the syntactic structure with complex predicates,
this distribution has not been described so far. On the basis of
a manual analysis of the richly annotated data from the Prague
Dependency Treebank, we thus formulate principles governing
this distribution. In an automatic experiment, we verify these
principles on well-formed syntactic structures from the Prague
18
Dependency Treebank and the Prague Czech-English Dependency
Treebank with very satisfactory results: the distribution of 97%
of valency complements in the surface structure is governed by
the proposed principles. These results corroborate that the surface
structure formation of complex predicates is a regular process.
A Lexical Resource of Hebrew Verb-NounMulti-Word Expressions
Chaya Liebeskind and Yaakov HaCohen-Kerner
A verb-noun Multi-Word Expression (MWE) is a combination
of a verb and a noun with or without other words, in which
the combination has a meaning different from the meaning of
the words considered separately. In this paper, we present
a new lexical resource of Hebrew Verb-Noun MWEs (VN-
MWEs). The VN-MWEs of this resource were manually collected
and annotated from five different web resources. In addition,
we analyze the lexical properties of Hebrew VN-MWEs by
classifying them to three types: morphological, syntactic, and
semantic. These two contributions are essential for designing
algorithms for automatic VN-MWEs extraction. The analysis
suggests some interesting features of VN-MWEs for exploration.
The lexical resource enables to sample a set of positive examples
for Hebrew VN-MWEs. This set of examples can either be used
for training supervised algorithms or as seeds in unsupervised
bootstrapping algorithms. Thus, this resource is a first step
towards automatic identification of Hebrew VN-MWEs, which
is important for natural language understanding, generation and
translation systems.
Cross-lingual Linking of Multi-word Entities andtheir corresponding Acronyms
Guillaume Jacquet, Maud Ehrmann, Ralf Steinberger andJaakko Väyrynen
This paper reports on an approach and experiments to
automatically build a cross-lingual multi-word entity resource.
Starting from a collection of millions of acronym/expansion pairs
for 22 languages where expansion variants were grouped into
monolingual clusters, we experiment with several aggregation
strategies to link these clusters across languages. Aggregation
strategies make use of string similarity distances and translation
probabilities and they are based on vector space and graph
representations. The accuracy of the approach is evaluated against
Wikipedia’s redirection and cross-lingual linking tables. The
resulting multi-word entity resource contains 64,000 multi-word
entities with unique identifiers and their 600,000 multilingual
lexical variants. We intend to make this new resource publicly
available.
O8 - Named Entity RecognitionWednesday, May 25, 14:45
Chairperson: Yuji Matsumoto Oral Session
SemLinker, a Modular and Open SourceFramework for Named Entity Discovery andLinkingMarie-Jean Meurs, Hayda Almeida, Ludovic Jean-Louisand Eric Charton
This paper presents SemLinker, an open source system that
discovers named entities, connects them to a reference knowledge
base, and clusters them semantically. SemLinker relies on
several modules that perform surface form generation, mutual
disambiguation, entity clustering, and make use of two annotation
engines. SemLinker was evaluated in the English Entity
Discovery and Linking track of the Text Analysis Conference
on Knowledge Base Population, organized by the US National
Institute of Standards and Technology. Along with the SemLinker
source code, we release our annotation files containing the
discovered named entities, their types, and position across
processed documents.
Context-enhanced Adaptive Entity LinkingFilip Ilievski, Giuseppe Rizzo, Marieke van Erp, Julien Pluand Raphael Troncy
More and more knowledge bases are publicly available as linked
data. Since these knowledge bases contain structured descriptions
of real-world entities, they can be exploited by entity linking
systems that anchor entity mentions from text to the most
relevant resources describing those entities. In this paper, we
investigate adaptation of the entity linking task using contextual
knowledge. The key intuition is that entity linking can be
customized depending on the textual content, as well as on the
application that would make use of the extracted information. We
present an adaptive approach that relies on contextual knowledge
from text to enhance the performance of ADEL, a hybrid linguistic
and graph-based entity linking system. We evaluate our approach
on a domain-specific corpus consisting of annotated WikiNews
articles.
Named Entity Recognition on Twitter for Turkishusing Semi-supervised Learning with WordEmbeddingsEda Okur, Hakan Demir and Arzucan Özgür
Recently, due to the increasing popularity of social media, the
necessity for extracting information from informal text types,
19
such as microblog texts, has gained significant attention. In
this study, we focused on the Named Entity Recognition (NER)
problem on informal text types for Turkish. We utilized a
semi-supervised learning approach based on neural networks.
We applied a fast unsupervised method for learning continuous
representations of words in vector space. We made use of these
obtained word embeddings, together with language independent
features that are engineered to work better on informal text types,
for generating a Turkish NER system on microblog texts. We
evaluated our Turkish NER system on Twitter messages and
achieved better F-score performances than the published results
of previously proposed NER systems on Turkish tweets. Since
we did not employ any language dependent features, we believe
that our method can be easily adapted to microblog texts in other
morphologically rich languages.
Entity Linking with a Paraphrase Flavor
Maria Pershina, Yifan He and Ralph Grishman
The task of Named Entity Linking is to link entity mentions
in the document to their correct entries in a knowledge base
and to cluster NIL mentions. Ambiguous, misspelled, and
incomplete entity mention names are the main challenges in the
linking process. We propose a novel approach that combines
two state-of-the-art models — for entity disambiguation and
for paraphrase detection — to overcome these challenges. We
consider name variations as paraphrases of the same entity
mention and adopt a paraphrase model for this task. Our
approach utilizes a graph-based disambiguation model based on
Personalized Page Rank, and then refines and clusters its output
using the paraphrase similarity between entity mention strings. It
achieves a competitive performance of 80.5% in B3+F clustering
score on diagnostic TAC EDL 2014 data.
Domain Adaptation for Named EntityRecognition Using CRFs
Tian Tian, Marco Dinarelli, Isabelle Tellier and Pedro DiasCardoso
In this paper we explain how we created a labelled corpus in
English for a Named Entity Recognition (NER) task from multi-
source and multi-domain data, for an industrial partner. We
explain the specificities of this corpus with examples and describe
some baseline experiments. We present some results of domain
adaptation on this corpus using a labelled Twitter corpus (Ritter
et al., 2011). We tested a semi-supervised method from (Garcia-
Fernandez et al., 2014) combined with a supervised domain
adaptation approach proposed in (Raymond and Fayolle, 2010) for
machine learning experiments with CRFs (Conditional Random
Fields). We use the same technique to improve the NER results
on the Twitter corpus (Ritter et al., 2011). Our contributions
thus consist in an industrial corpus creation and NER performance
improvements.
P05 - Machine Translation (1)Wednesday, May 25, 14:45
Chairperson: Martin Volk Poster Session
IRIS: English-Irish Machine Translation SystemMihael Arcan, Caoilfhionn Lane, Eoin Ó Droighneáin andPaul Buitelaar
We describe IRIS, a statistical machine translation (SMT) system
for translating from English into Irish and vice versa. Since Irish
is considered an under-resourced language with a limited amount
of machine-readable text, building a machine translation system
that produces reasonable translations is rather challenging. As
translation is a difficult task, current research in SMT focuses
on obtaining statistics either from a large amount of parallel,
monolingual or other multilingual resources. Nevertheless, we
collected available English-Irish data and developed an SMT
system aimed at supporting human translators and enabling cross-
lingual language technology tasks.
Linguistically Inspired Language ModelAugmentation for MTGeorge Tambouratzis and Vasiliki Pouli
The present article reports on efforts to improve the translation
accuracy of a corpus–based Machine Translation (MT) system.
In order to achieve that, an error analysis performed on past
translation outputs has indicated the likelihood of improving the
translation accuracy by augmenting the coverage of the Target-
Language (TL) side language model. The method adopted for
improving the language model is initially presented, based on the
concatenation of consecutive phrases. The algorithmic steps are
then described that form the process for augmenting the language
model. The key idea is to only augment the language model to
cover the most frequent cases of phrase sequences, as counted
over a TL-side corpus, in order to maximize the cases covered
by the new language model entries. Experiments presented in the
article show that substantial improvements in translation accuracy
are achieved via the proposed method, when integrating the grown
language model to the corpus-based MT system.
A Rule-based Shallow-transfer MachineTranslation System for Scots and EnglishGavin Abercrombie
An open-source rule-based machine translation system is
developed for Scots, a low-resourced minor language closely
related to English and spoken in Scotland and Ireland. By
20
concentrating on translation for assimilation (gist comprehension)
from Scots to English, it is proposed that the development of
dictionaries designed to be used with in the Apertium platform
will be sufficient to produce translations that improve non-Scots
speakers understanding of the language. Mono- and bilingual
Scots dictionaries are constructed using lexical items gathered
from a variety of resources across several domains. Although
the primary goal of this project is translation for gisting, the
system is evaluated for both assimilation and dissemination
(publication-ready translations). A variety of evaluation methods
are used, including a cloze test undertaken by human volunteers.
While evaluation results are comparable to, and in some cases
superior to, those of other language pairs within the Apertium
platform, room for improvement is identified in several areas of
the system.
Syntax-based Multi-system Machine Translation
Matiss Rikters and Inguna Skadina
This paper describes a hybrid machine translation system that
explores a parser to acquire syntactic chunks of a source sentence,
translates the chunks with multiple online machine translation
(MT) system application program interfaces (APIs) and creates
output by combining translated chunks to obtain the best possible
translation. The selection of the best translation hypothesis
is performed by calculating the perplexity for each translated
chunk. The goal of this approach is to enhance the baseline
multi-system hybrid translation (MHyT) system that uses only
a language model to select best translation from translations
obtained with different APIs and to improve overall English –
Latvian machine translation quality over each of the individual
MT APIs. The presented syntax-based multi-system translation
(SyMHyT) system demonstrates an improvement in terms of
BLEU and NIST scores compared to the baseline system.
Improvements reach from 1.74 up to 2.54 BLEU points.
Use of Domain-Specific Language Resources inMachine Translation
Sanja Štajner, Andreia Querido, Nuno Rendeiro, JoãoAntónio Rodrigues and António Branco
In this paper, we address the problem of Machine Translation
(MT) for a specialised domain in a language pair for which only
a very small domain-specific parallel corpus is available. We
conduct a series of experiments using a purely phrase-based SMT
(PBSMT) system and a hybrid MT system (TectoMT), testing
three different strategies to overcome the problem of the small
amount of in-domain training data. Our results show that adding
a small size in-domain bilingual terminology to the small in-
domain training corpus leads to the best improvements of a hybrid
MT system, while the PBSMT system achieves the best results
by adding a combination of in-domain bilingual terminology
and a larger out-of-domain corpus. We focus on qualitative
human evaluation of the output of two best systems (one for
each approach) and perform a systematic in-depth error analysis
which revealed advantages of the hybrid MT system over the pure
PBSMT system for this specific task.
CATaLog Online: Porting a Post-editing Tool tothe Web
Santanu Pal, Marcos Zampieri, Sudip Kumar Naskar, TapasNayak, Mihaela Vela and Josef van Genabith
This paper presents CATaLog online, a new web-based MT and
TM post-editing tool. CATaLog online is a freeware software
that can be used through a web browser and it requires only
a simple registration. The tool features a number of editing
and log functions similar to the desktop version of CATaLog
enhanced with several new features that we describe in detail in
this paper. CATaLog online is designed to allow users to post-edit
both translation memory segments as well as machine translation
output. The tool provides a complete set of log information
currently not available in most commercial CAT tools. Log
information can be used both for project management purposes
as well as for the study of the translation process and translator’s
productivity.
The ILMT-s2s Corpus – A MultimodalInterlingual Map Task Corpus
Akira Hayakawa, Saturnino Luz, Loredana Cerrato andNick Campbell
This paper presents the multimodal Interlingual Map Task Corpus
(ILMT-s2s corpus) collected at Trinity College Dublin, and
discuss some of the issues related to the collection and analysis
of the data. The corpus design is inspired by the HCRC Map Task
Corpus which was initially designed to support the investigation
of linguistic phenomena, and has been the focus of a variety of
studies of communicative behaviour. The simplicity of the task,
and the complexity of phenomena it can elicit, make the map
task an ideal object of study. Although there are studies that
used replications of the map task to investigate communication
in computer mediated tasks, this ILMT-s2s corpus is, to the
best of our knowledge, the first investigation of communicative
behaviour in the presence of three additional “filters”: Automatic
Speech Recognition (ASR), Machine Translation (MT) and Text
To Speech (TTS) synthesis, where the instruction giver and the
instruction follower speak different languages. This paper details
21
the data collection setup and completed annotation of the ILMT-
s2s corpus, and outlines preliminary results obtained from the
data.
Name Translation based on Fine-grained NamedEntity Recognition in a Single Language
Kugatsu Sadamitsu, Itsumi Saito, Taichi Katayama, HisakoAsano and Yoshihiro Matsuo
We propose named entity abstraction methods with fine-grained
named entity labels for improving statistical machine translation
(SMT). The methods are based on a bilingual named entity
recognizer that uses a monolingual named entity recognizer
with transliteration. Through experiments, we demonstrate that
incorporating fine-grained named entities into statistical machine
translation improves the accuracy of SMT with more adequate
granularity compared with the standard SMT, which is a non-
named entity abstraction method.
Lexical Resources to Enrich English MalayalamMachine Translation
Sreelekha S and Pushpak Bhattacharyya
In this paper we present our work on the usage of lexical resources
for the Machine Translation English and Malayalam. We describe
a comparative performance between different Statistical Machine
Translation (SMT) systems on top of phrase based SMT system
as baseline. We explore different ways of utilizing lexical
resources to improve the quality of English Malayalam statistical
machine translation. In order to enrich the training corpus we
have augmented the lexical resources in two ways (a) additional
vocabulary and (b) inflected verbal forms. Lexical resources
include IndoWordnet semantic relation set, lexical words and
verb phrases etc. We have described case studies, evaluations
and have given detailed error analysis for both Malayalam to
English and English to Malayalam machine translation systems.
We observed significant improvement in evaluations of translation
quality. Lexical resources do help uplift performance when
parallel corpora are scanty.
Novel elicitation and annotation schemes forsentential and sub-sentential alignments of bitexts
Yong Xu and François Yvon
Resources for evaluating sentence-level and word-level alignment
algorithms are unsatisfactory. Regarding sentence alignments, the
existing data is too scarce, especially when it comes to difficult
bitexts, containing instances of non-literal translations. Regarding
word-level alignments, most available hand-aligned data provide a
complete annotation at the level of words that is difficult to exploit,
for lack of a clear semantics for alignment links. In this study,
we propose new methodologies for collecting human judgements
on alignment links, which have been used to annotate 4 new data
sets, at the sentence and at the word level. These will be released
online, with the hope that they will prove useful to evaluate
alignment software and quality estimation tools for automatic
alignment. Keywords: Parallel corpora, Sentence Alignments,
Word Alignments, Confidence Estimation
PROTEST: A Test Suite for Evaluating Pronounsin Machine Translation
Liane Guillou and Christian Hardmeier
We present PROTEST, a test suite for the evaluation of pronoun
translation by MT systems. The test suite comprises 250 hand-
selected pronoun tokens and an automatic evaluation method
which compares the translations of pronouns in MT output with
those in the reference translation. Pronoun translations that do not
match the reference are referred for manual evaluation. PROTEST
is designed to support analysis of system performance at the level
of individual pronoun groups, rather than to provide a single
aggregate measure over all pronouns. We wish to encourage
detailed analyses to highlight issues in the handling of specific
linguistic mechanisms by MT systems, thereby contributing to
a better understanding of those problems involved in translating
pronouns. We present two use cases for PROTEST: a) for
measuring improvement/degradation of an incremental system
change, and b) for comparing the performance of a group of
systems whose design may be largely unrelated. Following the
latter use case, we demonstrate the application of PROTEST to the
evaluation of the systems submitted to the DiscoMT 2015 shared
task on pronoun translation.
Paraphrasing Out-of-Vocabulary Words withWord Embeddings and Semantic Lexicons forLow Resource Statistical Machine Translation
Chenhui Chu and Sadao Kurohashi
Out-of-vocabulary (OOV) word is a crucial problem in
statistical machine translation (SMT) with low resources. OOV
paraphrasing that augments the translation model for the OOV
words by using the translation knowledge of their paraphrases has
been proposed to address the OOV problem. In this paper, we
propose using word embeddings and semantic lexicons for OOV
paraphrasing. Experiments conducted on a low resource setting of
the OLYMPICS task of IWSLT 2012 verify the effectiveness of
our proposed method.
22
P06 - ParsingWednesday, May 25, 14:45
Chairperson: Giuseppe Attardi Poster Session
The Denoised Web Treebank: EvaluatingDependency Parsing under Noisy InputConditions
Joachim Daiber and Rob van der Goot
We introduce the Denoised Web Treebank: a treebank including
a normalization layer and a corresponding evaluation metric
for dependency parsing of noisy text, such as Tweets. This
benchmark enables the evaluation of parser robustness as well as
text normalization methods, including normalization as machine
translation and unsupervised lexical normalization, directly on
syntactic trees. Experiments show that text normalization together
with a combination of domain-specific and generic part-of-speech
taggers can lead to a significant improvement in parsing accuracy
on this test set.
Punctuation Prediction for UnsegmentedTranscript Based on Word Vector
Xiaoyin Che, Cheng Wang, Haojin Yang and ChristophMeinel
In this paper we propose an approach to predict punctuation marks
for unsegmented speech transcript. The approach is purely lexical,
with pre-trained Word Vectors as the only input. A training model
of Deep Neural Network (DNN) or Convolutional Neural Network
(CNN) is applied to classify whether a punctuation mark should be
inserted after the third word of a 5-words sequence and which kind
of punctuation mark the inserted one should be. TED talks within
IWSLT dataset are used in both training and evaluation phases.
The proposed approach shows its effectiveness by achieving better
result than the state-of-the-art lexical solution which works with
same type of data, especially when predicting puncuation position
only.
Evaluating a Deterministic Shift-Reduce NeuralParser for Constituent Parsing
Hao Zhou, Yue Zhang, Shujian Huang, Xin-Yu Dai andJiajun Chen
Greedy transition-based parsers are appealing for their very fast
speed, with reasonably high accuracies. In this paper, we build
a fast shift-reduce neural constituent parser by using a neural
network to make local decisions. One challenge to the parsing
speed is the large hidden and output layer sizes caused by the
number of constituent labels and branching options. We speed
up the parser by using a hierarchical output layer, inspired by the
hierarchical log-bilinear neural language model. In standard WSJ
experiments, the neural parser achieves an almost 2.4 time speed
up (320 sen/sec) compared to a non-hierarchical baseline without
significant accuracy loss (89.06 vs 89.13 F-score).
Language Resource Addition Strategies for RawText ParsingAtsushi Ushiku, Tetsuro Sasada and Shinsuke Mori
We focus on the improvement of accuracy of raw text parsing,
from the viewpoint of language resource addition. In Japanese, the
raw text parsing is divided into three steps: word segmentation,
part-of-speech tagging, and dependency parsing. We investigate
the contribution of language resource addition in each of three
steps to the improvement in accuracy for two domain corpora. The
experimental results show that this improvement depends on the
target domain. For example, when we handle well-written texts of
limited vocabulary, white paper, an effective language resource is
a word-POS pair sequence corpus for the parsing accuracy. So we
conclude that it is important to check out the characteristics of the
target domain and to choose a suitable language resource addition
strategy for the parsing accuracy improvement.
E-TIPSY: Search Query Corpus Annotated withEntities, Term Importance, POS Tags, andSyntactic ParsesYuval Marton and Kristina Toutanova
We present E-TIPSY, a search query corpus annotated with named
Entities, Term Importance, POS tags, and SYntactic parses. This
corpus contains crowdsourced (gold) annotations of the three
most important terms in each query. In addition, it contains
automatically produced annotations of named entities, part-of-
speech tags, and syntactic parses for the same queries. This
corpus comes in two formats: (1) Sober Subset: annotations that
two or more crowd workers agreed upon, and (2) Full Glass: all
annotations. We analyze the strikingly low correlation between
term importance and syntactic headedness, which invites research
into effective ways of combining these different signals. Our
corpus can serve as a benchmark for term importance methods
aimed at improving search engine quality and as an initial step
toward developing a dataset of gold linguistic analysis of web
search queries. In addition, it can be used as a basis for linguistic
inquiries into the kind of expressions used in search.
AfriBooms: An Online Treebank for AfrikaansLiesbeth Augustinus, Peter Dirix, Daniel Van Niekerk, InekeSchuurman, Vincent Vandeghinste, Frank Van Eynde andGerhard Van Huyssteen
Compared to well-resourced languages such as English and
Dutch, natural language processing (NLP) tools for Afrikaans are
still not abundant. In the context of the AfriBooms project, KU
23
Leuven and the North-West University collaborated to develop
a first, small treebank, a dependency parser, and an easy to
use online linguistic search engine for Afrikaans for use by
researchers and students in the humanities and social sciences.
The search tool is based on a similar development for Dutch,
i.e. GrETEL, a user-friendly search engine which allows users to
query a treebank by means of a natural language example instead
of a formal search instruction.
Differentia compositionem facit. A Slower-Pacedand Reliable Parser for Latin
Edoardo Maria Ponti and Marco Passarotti
The Index Thomisticus Treebank is the largest available
treebank for Latin; it contains Medieval Latin texts by Thomas
Aquinas. After experimenting on its data with a number
of dependency parsers based on different supervised machine
learning techniques, we found that DeSR with a multilayer
perceptron algorithm, a right-to-left transition, and a tailor-
made feature model is the parser providing the highest accuracy
rates. We improved the results further by using a technique
that combines the output parses of DeSR with those provided
by other parsers, outperforming the previous state of the art in
parsing the Index Thomisticus Treebank. The key idea behind
such improvement is to ensure a sufficient diversity and accuracy
of the outputs to be combined; for this reason, we performed an
in-depth evaluation of the results provided by the different parsers
that we combined. Finally, we assessed that, although the general
architecture of the parser is portable to Classical Latin, yet the
model trained on Medieval Latin is inadequate for such purpose.
South African Language Resources: PhraseChunking
Roald Eiselen
Phrase chunking remains an important natural language
processing (NLP) technique for intermediate syntactic processing.
This paper describes the development of protocols, annotated
phrase chunking data sets and automatic phrase chunkers for
ten South African languages. Various problems with adapting
the existing annotation protocols of English are discussed as
well as an overview of the annotated data sets. Based on the
annotated sets, CRF-based phrase chunkers are created and tested
with a combination of different features, including part of speech
tags and character n-grams. The results of the phrase chunking
evaluation show that disjunctively written languages can achieve
notably better results for phrase chunking with a limited data
set than conjunctive languages, but that the addition of character
n-grams improve the results for conjunctive languages.
Neural Scoring Function for MST Parser
Jindrich Libovický
Continuous word representations appeared to be a useful feature
in many natural language processing tasks. Using fixed-dimension
pre-trained word embeddings allows avoiding sparse bag-of-
words representation and to train models with fewer parameters.
In this paper, we use fixed pre-trained word embeddings as
additional features for a neural scoring function in the MST parser.
With the multi-layer architecture of the scoring function we can
avoid handcrafting feature conjunctions. The continuous word
representations on the input also allow us to reduce the number of
lexical features, make the parser more robust to out-of-vocabulary
words, and reduce the total number of parameters of the model.
Although its accuracy stays below the state of the art, the model
size is substantially smaller than with the standard features set.
Moreover, it performs well for languages where only a smaller
treebank is available and the results promise to be useful in cross-
lingual parsing.
Analysing Constraint Grammars with aSAT-solver
Inari Listenmaa and Koen Claessen
We describe a method for analysing Constraint Grammars (CG)
that can detect internal conflicts and redundancies in a given
grammar, without the need for a corpus. The aim is for grammar
writers to be able to automatically diagnose, and then manually
improve their grammars. Our method works by translating the
given grammar into logical constraints that are analysed by a SAT-
solver. We have evaluated our analysis on a number of non-trivial
grammars and found inconsistencies.
Old French Dependency Parsing: Results of TwoParsers Analysed from a Linguistic Point of View
Achim Stein
The treatment of medieval texts is a particular challenge for
parsers. I compare how two dependency parsers, one graph-based,
the other transition-based, perform on Old French, facing some
typical problems of medieval texts: graphical variation, relatively
free word order, and syntactic variation of several parameters
over a diachronic period of about 300 years. Both parsers were
trained and evaluated on the “Syntactic Reference Corpus of
Medieval French” (SRCMF), a manually annotated dependency
treebank. I discuss the relation between types of parsers and types
24
of language, as well as the differences of the analyses from a
linguistic point of view.
Semi-automatic Parsing for Web KnowledgeExtraction through Semantic Annotation
Maria Pia di Buono
Parsing Web information, namely parsing content to find relevant
documents on the basis of a user’s query, represents a crucial
step to guarantee fast and accurate Information Retrieval (IR).
Generally, an automated approach to such task is considered faster
and cheaper than manual systems. Nevertheless, results do not
seem have a high level of accuracy, indeed, as also Hjorland
(2007) states, using stochastic algorithms entails: • Low precision
due to the indexing of common Atomic Linguistic Units (ALUs)
or sentences. • Low recall caused by the presence of synonyms.
• Generic results arising from the use of too broad or too narrow
terms. Usually IR systems are based on invert text index, namely
an index data structure storing a mapping from content to its
locations in a database file, or in a document or a set of documents.
In this paper we propose a system, by means of which we will
develop a search engine able to process online documents, starting
from a natural language query, and to return information to users.
The proposed approach, based on the Lexicon-Grammar (LG)
framework and its language formalization methodologies, aims at
integrating a semantic annotation process for both query analysis
and document retrieval.
P07 - Speech Corpora and Databases (1)Wednesday, May 25, 14:45
Chairperson: Carmen García Mateo Poster Session
Towards Automatic Transcription of ILSE – anInterdisciplinary Longitudinal Study of AdultDevelopment and Aging
Jochen Weiner, Claudia Frankenberg, Dominic Telaar,Britta Wendelstein, Johannes Schröder and Tanja Schultz
The Interdisciplinary Longitudinal Study on Adult Development
and Aging (ILSE) was created to facilitate the study of challenges
posed by rapidly aging societies in developed countries such
as Germany. ILSE contains over 8,000 hours of biographic
interviews recorded from more than 1,000 participants over the
course of 20 years. Investigations on various aspects of aging,
such as cognitive decline, often rely on the analysis of linguistic
features which can be derived from spoken content like these
interviews. However, transcribing speech is a time and cost
consuming manual process and so far only 380 hours of ILSE
interviews have been transcribed. Thus, it is the aim of our work
to establish technical systems to fully automatically transcribe
the ILSE interview data. The joint occurrence of poor recording
quality, long audio segments, erroneous transcriptions, varying
speaking styles & crosstalk, and emotional & dialectal speech
in these interviews presents challenges for automatic speech
recognition (ASR). We describe our ongoing work towards the
fully automatic transcription of all ILSE interviews and the
steps we implemented in preparing the transcriptions to meet the
interviews’ challenges. Using a recursive long audio alignment
procedure 96 hours of the transcribed data have been made
accessible for ASR training.
FABIOLE, a Speech Database for ForensicSpeaker Comparison
Moez Ajili, Jean-françois Bonastre, Juliette Kahn, SolangeRossato and Guillaume Bernard
A speech database has been collected for use to highlight the
importance of “speaker factor” in forensic voice comparison.
FABIOLE has been created during the FABIOLE project funded
by the French Research Agency (ANR) from 2013 to 2016. This
corpus consists in more than 3 thousands excerpts spoken by 130
French native male speakers. The speakers are divided into two
categories: 30 target speakers who everyone has 100 excerpts
and 100 “impostors” who everyone has only one excerpt. The
data were collected from 10 different French radio and television
shows where each utterance turns with a minimum duration of
30s and has a good speech quality. The data set is mainly used
for investigating speaker factor in forensic voice comparison and
interpreting some unsolved issue such as the relationship between
speaker characteristics and system behavior. In this paper, we
present FABIOLE database. Then, preliminary experiments are
performed to evaluate the effect of the “speaker factor” and the
show on a voice comparison system behavior.
Phonetic Inventory for an Arabic Speech Corpus
Nawar Halabi and Mike Wald
Corpus design for speech synthesis is a well-researched topic in
languages such as English compared to Modern Standard Arabic,
and there is a tendency to focus on methods to automatically
generate the orthographic transcript to be recorded (usually greedy
methods). In this work, a study of Modern Standard Arabic
(MSA) phonetics and phonology is conducted in order to create
criteria for a greedy method to create a speech corpus transcript
for recording. The size of the dataset is reduced a number of
times using these optimisation methods with different parameters
to yield a much smaller dataset with identical phonetic coverage
than before the reduction, and this output transcript is chosen for
25
recording. This is part of a larger work to create a completely
annotated and segmented speech corpus for MSA.
AIMU: Actionable Items for MeetingUnderstandingYun-Nung Chen and Dilek Hakkani-Tur
With emerging conversational data, automated content analysis
is needed for better data interpretation, so that it is accurately
understood and can be effectively integrated and utilized in
various applications. ICSI meeting corpus is a publicly released
data set of multi-party meetings in an organization that has been
released over a decade ago, and has been fostering meeting
understanding research since then. The original data collection
includes transcription of participant turns as well as meta-data
annotations, such as disfluencies and dialog act tags. This paper
presents an extended set of annotations for the ICSI meeting
corpus with a goal of deeply understanding meeting conversations,
where participant turns are annotated by actionable items that
could be performed by an automated meeting assistant. In addition
to the user utterances that contain an actionable item, annotations
also include the arguments associated with the actionable item.
The set of actionable items are determined by aligning human-
human interactions to human-machine interactions, where a
data annotation schema designed for a virtual personal assistant
(human-machine genre) is adapted to the meetings domain
(human-human genre). The data set is formed by annotating
participants’ utterances in meetings with potential intents/actions
considering their contexts. The set of actions target what could be
accomplished by an automated meeting assistant, such as taking
a note of action items that a participant commits to, or finding
emails or topic related documents that were mentioned during the
meeting. A total of 10 defined intents/actions are considered as
actionable items in meetings. Turns that include actionable intents
were annotated for 22 public ICSI meetings, that include a total of
21K utterances, segmented by speaker turns. Participants’ spoken
turns, possible actions along with associated arguments and
their vector representations as computed by convolutional deep
structured semantic models are included in the data set for future
research. We present a detailed statistical analysis of the data
set and analyze the performance of applying convolutional deep
structured semantic models for an actionable item detection task.
The data is available at http://research.microsoft.
com/projects/meetingunderstanding/.
A Taxonomy of Specific Problem Classes inText-to-Speech Synthesis: ComparingCommercial and Open Source PerformanceFelix Burkhardt and Uwe D. Reichel
Current state-of-the-art speech synthesizers for domain-
independent systems still struggle with the challenge of generating
understandable and natural-sounding speech. This is mainly
because the pronunciation of words of foreign origin, inflections
and compound words often cannot be handled by rules.
Furthermore there are too many of these for inclusion in exception
dictionaries. We describe an approach to evaluating text-to-speech
synthesizers with a subjective listening experiment. The focus
is to differentiate between known problem classes for speech
synthesizers. The target language is German but we believe that
many of the described phenomena are not language specific. We
distinguish the following problem categories: Normalization,
Foreign linguistics, Natural writing, Language specific and
General. Each of them is divided into five to three problem
classes. Word lists for each of the above mentioned categories
were compiled and synthesized by both a commercial and an
open source synthesizer, both being based on the non-uniform
unit-selection approach. The synthesized speech was evaluated by
human judges using the Speechalyzer toolkit and the results are
discussed. It shows that, as expected, the commercial synthesizer
performs much better than the open-source one, and especially
words of foreign origin were pronounced badly by both systems.
A Comparative Analysis of Crowdsourced NaturalLanguage Corpora for Spoken Dialog Systems
Patricia Braunger, Hansjörg Hofmann, Steffen Werner andMaria Schmidt
Recent spoken dialog systems have been able to recognize freely
spoken user input in restricted domains thanks to statistical
methods in the automatic speech recognition. These methods
require a high number of natural language utterances to train
the speech recognition engine and to assess the quality of the
system. Since human speech offers many variants associated
with a single intent, a high number of user utterances have to
be elicited. Developers are therefore turning to crowdsourcing to
collect this data. This paper compares three different methods to
elicit multiple utterances for given semantics via crowd sourcing,
namely with pictures, with text and with semantic entities.
Specifically, we compare the methods with regard to the number
of valid data and linguistic variance, whereby a quantitative and
qualitative approach is proposed. In our study, the method with
text led to a high variance in the utterances and a relatively low
rate of invalid data.
A Singing Voice Database in Basque for StatisticalSinging Synthesis of Bertsolaritza
Xabier Sarasola, Eva Navas, David Tavarez, Daniel Erro,Ibon Saratxaga and Inma Hernaez
This paper describes the characteristics and structure of a Basque
singing voice database of bertsolaritza. Bertsolaritza is a popular
26
singing style from Basque Country sung exclusively in Basque
that is improvised and a capella. The database is designed to
be used in statistical singing voice synthesis for bertsolaritza
style. Starting from the recordings and transcriptions of numerous
singers, diarization and phoneme alignment experiments have
been made to extract the singing voice from the recordings and
create phoneme alignments. This labelling processes have been
performed applying standard speech processing techniques and
the results prove that these techniques can be used in this specific
singing style.
AMISCO: The Austrian German Multi-SensorCorpus
Hannes Pessentheiner, Thomas Pichler and MartinHagmüller
We introduce a unique, comprehensive Austrian German multi-
sensor corpus with moving and non-moving speakers to facilitate
the evaluation of estimators and detectors that jointly detect
a speaker’s spatial and temporal parameters. The corpus is
suitable for various machine learning and signal processing
tasks, linguistic studies, and studies related to a speaker’s
fundamental frequency (due to recorded glottograms). Available
corpora are limited to (synthetically generated/spatialized) speech
data or recordings of musical instruments that lack moving
speakers, glottograms, and/or multi-channel distant speech
recordings. That is why we recorded 24 spatially non-moving
and moving speakers, balanced male and female, to set up a
two-room and 43-channel Austrian German multi-sensor speech
corpus. It contains 8.2 hours of read speech based on
phonetically balanced sentences, commands, and digits. The
orthographic transcriptions include around 53,000 word tokens
and 2,070 word types. Special features of this corpus are
the laryngograph recordings (representing glottograms required
to detect a speaker’s instantaneous fundamental frequency
and pitch), corresponding clean-speech recordings, and spatial
information and video data provided by four Kinects and a
camera.
A Database of Laryngeal High-Speed Videos withSimultaneous High-Quality Audio Recordings ofPathological and Non-Pathological Voices
Philipp Aichinger, Immer Roesner, Matthias Leonhard,Doris-Maria Denk-Linnert, Wolfgang Bigenzahn and BeritSchneider-Stickler
Auditory voice quality judgements are used intensively for
the clinical assessment of pathological voice. Voice quality
concepts are fuzzily defined and poorly standardized however,
which hinders scientific and clinical communication. The
described database documents a wide variety of pathologies
and is used to investigate auditory voice quality concepts with
regard to phonation mechanisms. The database contains 375
laryngeal high-speed videos and simultaneous high-quality audio
recordings of sustained phonations of 80 pathological and 40
non-pathological subjects. Interval wise annotations regarding
video and audio quality, as well as voice quality ratings are
provided. Video quality is annotated for the visibility of
anatomical structures and artefacts such as blurring or reduced
contrast. Voice quality annotations include ratings on the presence
of dysphonia and diplophonia. The purpose of the database is
to aid the formulation of observationally well-founded models
of phonation and the development of model-based automatic
detectors for distinct types of phonation, especially for clinically
relevant nonmodal voice phenomena. Another application is
the training of audio-based fundamental frequency extractors on
video-based reference fundamental frequencies.
BulPhonC: Bulgarian Speech Corpus for theDevelopment of ASR TechnologyNeli Hateva, Petar Mitankin and Stoyan Mihov
In this paper we introduce a Bulgarian speech database, which
was created for the purpose of ASR technology development. The
paper describes the design and the content of the speech database.
We present also an empirical evaluation of the performance of a
LVCSR system for Bulgarian trained on the BulPhonC data. The
resource is available free for scientific usage.
Designing a Speech Corpus for the Developmentand Evaluation of Dictation Systems in LatvianMarcis Pinnis, Askars Salimbajevs and Ilze Auzina
In this paper the authors present a speech corpus designed
and created for the development and evaluation of dictation
systems in Latvian. The corpus consists of over nine hours of
orthographically annotated speech from 30 different speakers.
The corpus features spoken commands that are common for
dictation systems for text editors. The corpus is evaluated in an
automatic speech recognition scenario. Evaluation results in an
ASR dictation scenario show that the addition of the corpus to the
acoustic model training data in combination with language model
adaptation allows to decrease the WER by up to relative 41.36%
(or 16.83% in absolute numbers) compared to a baseline system
without language model adaptation. Contribution of acoustic data
augmentation is at relative 12.57% (or 3.43% absolute).
The LetsRead Corpus of Portuguese ChildrenReading Aloud for Performance EvaluationJorge Proença, Dirce Celorico, Sara Candeias, CarlaLopes and Fernando Perdigão
This paper introduces the LetsRead Corpus of European
Portuguese read speech from 6 to 10 years old children.
27
The motivation for the creation of this corpus stems from
the inexistence of databases with recordings of reading tasks
of Portuguese children with different performance levels and
including all the common reading aloud disfluencies. It is
also essential to develop techniques to fulfill the main objective
of the LetsRead project: to automatically evaluate the reading
performance of children through the analysis of reading tasks.
The collected data amounts to 20 hours of speech from 284
children from private and public Portuguese schools, with each
child carrying out two tasks: reading sentences and reading a list
of pseudowords, both with varying levels of difficulty throughout
the school grades. In this paper, the design of the reading
tasks presented to children is described, as well as the collection
procedure. Manually annotated data is analyzed according to
disfluencies and reading performance. The considered word
difficulty parameter is also confirmed to be suitable for the
pseudoword reading tasks.
The BAS Speech Data Repository
Uwe Reichel, Florian Schiel, Thomas Kisler, ChristophDraxler and Nina Pörner
The BAS CLARIN speech data repository is introduced. At the
current state it comprises 31 pre-dominantly German corpora of
spoken language. It is compliant to the CLARIN-D as well
as the OLAC requirements. This enables its embedding into
several infrastructures. We give an overview over its structure,
its implementation as well as the corpora it contains.
A Dutch Dysarthric Speech Database forIndividualized Speech Therapy Research
Emre Yilmaz, Mario Ganzeboom, Lilian Beijer, CatiaCucchiarini and Helmer Strik
We present a new Dutch dysarthric speech database containing
utterances of neurological patients with Parkinson’s disease,
traumatic brain injury and cerebrovascular accident. The speech
content is phonetically and linguistically diversified by using
numerous structured sentence and word lists. Containing
more than 6 hours of mildly to moderately dysarthric speech,
this database can be used for research on dysarthria and
for developing and testing speech-to-text systems designed for
medical applications. Current activities aimed at extending this
database are also discussed.
P08 - SummarisationWednesday, May 25, 14:45
Chairperson: Gerard de Melo Poster Session
Urdu Summary Corpus
Muhammad Humayoun, Rao Muhammad Adeel Nawab,Muhammad Uzair, Saba Aslam and Omer Farzand
Language resources, such as corpora, are important for various
natural language processing tasks. Urdu has millions of speakers
around the world but it is under-resourced in terms of standard
evaluation resources. This paper reports the construction of a
benchmark corpus for Urdu summaries (abstracts) to facilitate the
development and evaluation of single document summarization
systems for Urdu language. In Urdu, space does not always
mark word boundary. Therefore, we created two versions of the
same corpus. In the first version, words are separated by space.
In contrast, proper word boundaries are manually tagged in the
second version. We further apply normalization, part-of-speech
tagging, morphological analysis, lemmatization, and stemming
for the articles and their summaries in both versions. In order to
apply these annotations, we re-implemented some NLP tools for
Urdu. We provide Urdu Summary Corpus, all these annotations
and the needed software tools (as open-source) for researchers
to run experiments and to evaluate their work including but not
limited to single-document summarization task.
A Publicly Available Indonesian Corpora forAutomatic Abstractive and Extractive ChatSummarization
Fajri Koto
In this paper we report our effort to construct the first ever
Indonesian corpora for chat summarization. Specifically, we
utilized documents of multi-participant chat from a well known
online instant messaging application, WhatsApp. We construct
the gold standard by asking three native speakers to manually
summarize 300 chat sections (152 of them contain images).
As result, three reference summaries in extractive and either
abstractive form are produced for each chat sections. The
corpus is still in its early stage of investigation, yielding exciting
possibilities of future works.
Revisiting Summarization Evaluation forScientific Articles
Arman Cohan and Nazli Goharian
Evaluation of text summarization approaches have been mostly
based on metrics that measure similarities of system generated
summaries with a set of human written gold-standard summaries.
28
The most widely used metric in summarization evaluation has
been the ROUGE family. ROUGE solely relies on lexical overlaps
between the terms and phrases in the sentences; therefore, in
cases of terminology variations and paraphrasing, ROUGE is
not as effective. Scientific article summarization is one such
case that is different from general domain summarization (e.g.
newswire data). We provide an extensive analysis of ROUGE’s
effectiveness as an evaluation metric for scientific summarization;
we show that, contrary to the common belief, ROUGE is not
much reliable in evaluating scientific summaries. We furthermore
show how different variants of ROUGE result in very different
correlations with the manual Pyramid scores. Finally, we propose
an alternative metric for summarization evaluation which is based
on the content relevance between a system generated summary
and the corresponding human written summaries. We call our
metric SERA (Summarization Evaluation by Relevance Analysis).
Unlike ROUGE, SERA consistently achieves high correlations
with manual scores which shows its effectiveness in evaluation
of scientific article summarization.
The OnForumS corpus from the Shared Task onOnline Forum Summarisation at MultiLing 2015
Mijail Kabadjov, Udo Kruschwitz, Massimo Poesio, JosefSteinberger, Jorge Valderrama and Hugo Zaragoza
In this paper we present the OnForumS corpus developed for the
shared task of the same name on Online Forum Summarisation
(OnForumS at MultiLing’15). The corpus consists of a set
of news articles with associated readers’ comments from The
Guardian (English) and La Repubblica (Italian). It comes
with four levels of annotation: argument structure, comment-
article linking, sentiment and coreference. The former three
were produced through crowdsourcing, whereas the latter, by an
experienced annotator using a mature annotation scheme. Given
its annotation breadth, we believe the corpus will prove a useful
resource in stimulating and furthering research in the areas of
Argumentation Mining, Summarisation, Sentiment, Coreference
and the interlinks therein.
P09 - Word Sense Disambiguation (1)Wednesday, May 25, 14:45
Chairperson: Luca Dini Poster Session
Automatic Enrichment of WordNet withCommon-Sense Knowledge
Luigi Di Caro and Guido Boella
WordNet represents a cornerstone in the Computational
Linguistics field, linking words to meanings (or senses) through a
taxonomical representation of synsets, i.e., clusters of words with
an equivalent meaning in a specific context often described by
few definitions (or glosses) and examples. Most of the approaches
to the Word Sense Disambiguation task fully rely on these short
texts as a source of contextual information to match with the
input text to disambiguate. This paper presents the first attempt to
enrich synsets data with common-sense definitions, automatically
retrieved from ConceptNet 5, and disambiguated accordingly to
WordNet. The aim was to exploit the shared- and immediate-
thinking nature of common-sense knowledge to extend the short
but incredibly useful contextual information of the synsets. A
manual evaluation on a subset of the entire result (which counts
a total of almost 600K synset enrichments) shows a very high
precision with an estimated good recall.
VPS-GradeUp: Graded Decisions on UsagePatterns
Vít Baisa, Silvie Cinkova, Ema Krejcová and AnnaVernerová
We present VPS-GradeUp – a set of 11,400 graded human
decisions on usage patterns of 29 English lexical verbs from
the Pattern Dictionary of English Verbs by Patrick Hanks.
The annotation contains, for each verb lemma, a batch of 50
concordances with the given lemma as KWIC, and for each
of these concordances we provide a graded human decision
on how well the individual PDEV patterns for this particular
lemma illustrate the given concordance, indicated on a 7-point
Likert scale for each PDEV pattern. With our annotation, we
were pursuing a pilot investigation of the foundations of human
clustering and disambiguation decisions with respect to usage
patterns of verbs in context. The data set is publicly available
at http://hdl.handle.net/11234/1-1585.
Sense-annotating a Lexical Substitution Data Setwith Ubyline
Tristan Miller, Mohamed Khemakhem, Richard Eckart deCastilho and Iryna Gurevych
We describe the construction of GLASS, a newly sense-
annotated version of the German lexical substitution data set
used at the GermEval 2015: LexSub shared task. Using the
two annotation layers, we conduct the first known empirical
study of the relationship between manually applied word
senses and lexical substitutions. We find that synonymy and
hypernymy/hyponymy are the only semantic relations directly
linking targets to their substitutes, and that substitutes in the
target’s hypernymy/hyponymy taxonomy closely align with the
synonyms of a single GermaNet synset. Despite this, these
substitutes account for a minority of those provided by the
29
annotators. The results of our analysis accord with those
of a previous study on English-language data (albeit with
automatically induced word senses), leading us to suspect that the
sense–substitution relations we discovered may be of a universal
nature. We also tentatively conclude that relatively cheap lexical
substitution annotations can be used as a knowledge source for
automatic WSD. Also introduced in this paper is Ubyline, the
web application used to produce the sense annotations. Ubyline
presents an intuitive user interface optimized for annotating lexical
sample data, and is readily adaptable to sense inventories other
than GermaNet.
A Corpus of Literal and Idiomatic Uses ofGerman Infinitive-Verb Compounds
Andrea Horbach, Andrea Hensler, Sabine Krome, JakobPrange, Werner Scholze-Stubenrecht, Diana Steffen, StefanThater, Christian Wellner and Manfred Pinkal
We present an annotation study on a representative dataset of
literal and idiomatic uses of German infinitive-verb compounds
in newspaper and journal texts. Infinitive-verb compounds form
a challenge for writers of German, because spelling regulations
are different for literal and idiomatic uses. Through the
participation of expert lexicographers we were able to obtain a
high-quality corpus resource which offers itself as a testbed for
automatic idiomaticity detection and coarse-grained word-sense
disambiguation. We trained a classifier on the corpus which was
able to distinguish literal and idiomatic uses with an accuracy of
85 %.
The SemDaX Corpus – Sense Annotations withScalable Sense Inventories
Bolette Pedersen, Anna Braasch, Anders Johannsen,Héctor Martínez Alonso, Sanni Nimb, Sussi Olsen, AndersSøgaard and Nicolai Hartvig Sørensen
We launch the SemDaX corpus which is a recently completed
Danish human-annotated corpus available through a CLARIN
academic license. The corpus includes approx. 90,000 words,
comprises six textual domains, and is annotated with sense
inventories of different granularity. The aim of the developed
corpus is twofold: i) to assess the reliability of the different sense
annotation schemes for Danish measured by qualitative analyses
and annotation agreement scores, and ii) to serve as training
and test data for machine learning algorithms with the practical
purpose of developing sense taggers for Danish. To these aims,
we take a new approach to human-annotated corpus resources by
double annotating a much larger part of the corpus than what is
normally seen: for the all-words task we double annotated 60%
of the material and for the lexical sample task 100%. We include
in the corpus not only the adjucated files, but also the diverging
annotations. In other words, we consider not all disagreement
to be noise, but rather to contain valuable linguistic information
that can help us improve our annotation schemes and our learning
algorithms.
Graded and Word-Sense-DisambiguationDecisions in Corpus Pattern Analysis: a PilotStudySilvie Cinkova, Ema Krejcová, Anna Vernerová and VítBaisa
We present a pilot analysis of a new linguistic resource, VPS-
GradeUp (available at http://hdl.handle.net/11234/
1-1585). The resource contains 11,400 graded human decisions
on usage patterns of 29 English lexical verbs, randomly selected
from the Pattern Dictionary of English Verbs (Hanks, 2000
2014) based on their frequency and the number of senses their
lemmas have in PDEV. This data set has been created to observe
the interannotator agreement on PDEV patterns produced using
the Corpus Pattern Analysis (Hanks, 2013). Apart from the
graded decisions, the data set also contains traditional Word-
Sense-Disambiguation (WSD) labels. We analyze the associations
between the graded annotation and WSD annotation. The results
of the respective annotations do not correlate with the size of the
usage pattern inventory for the respective verbs lemmas, which
makes the data set worth further linguistic analysis.
Multi-prototype Chinese Character EmbeddingYanan Lu, Yue Zhang and Donghong Ji
Chinese sentences are written as sequences of characters, which
are elementary units of syntax and semantics. Characters are
highly polysemous in forming words. We present a position-
sensitive skip-gram model to learn multi-prototype Chinese
character embeddings, and explore the usefulness of such
character embeddings to Chinese NLP tasks. Evaluation on
character similarity shows that multi-prototype embeddings are
significantly better than a single-prototype baseline. In addition,
used as features in the Chinese NER task, the embeddings result
in a 1.74% F-score improvement over a state-of-the-art baseline.
A comparison of Named-Entity Disambiguationand Word Sense DisambiguationAngel Chang, Valentin I. Spitkovsky, Christopher D.Manning and Eneko Agirre
Named Entity Disambiguation (NED) is the task of linking
a named-entity mention to an instance in a knowledge-base,
typically Wikipedia-derived resources like DBpedia. This task
is closely related to word-sense disambiguation (WSD), where
the mention of an open-class word is linked to a concept in a
knowledge-base, typically WordNet. This paper analyzes the
30
relation between two annotated datasets on NED and WSD,
highlighting the commonalities and differences. We detail the
methods to construct a NED system following the WSD word-
expert approach, where we need a dictionary and one classifier
is built for each target entity mention string. Constructing a
dictionary for NED proved challenging, and although similarity
and ambiguity are higher for NED, the results are also higher
due to the larger number of training data, and the more crisp and
skewed meaning differences.
O9 - Linked DataWednesday, May 25, 16:45
Chairperson: John McCrae Oral Session
Leveraging RDF Graphs for Crossing MultipleBilingual Dictionaries
Marta Villegas, Maite Melero, Núria Bel and Jorge Gracia
The experiments presented here exploit the properties of
the Apertium RDF Graph, principally cycle density and
nodes’ degree, to automatically generate new translation
relations between words, and therefore to enrich existing
bilingual dictionaries with new entries. Currently, the
Apertium RDF Graph includes data from 22 Apertium bilingual
dictionaries and constitutes a large unified array of linked
lexical entries and translations that are available and accessible
on the Web (http://linguistic.linkeddata.es/
apertium/). In particular, its graph structure allows
for interesting exploitation opportunities, some of which are
addressed in this paper. Two ’massive’ experiments are reported:
in the first one, the original EN-ES translation set was removed
from the Apertium RDF Graph and a new EN-ES version was
generated. The results were compared against the previously
removed EN-ES data and against the Concise Oxford Spanish
Dictionary. In the second experiment, a new non-existent EN-
FR translation set was generated. In this case the results were
compared against a converted wiktionary English-French file. The
results we got are really good and perform well for the extreme
case of correlated polysemy. This lead us to address the possibility
to use cycles and nodes degree to identify potential oddities in the
source data. If cycle density proves efficient when considering
potential targets, we can assume that in dense graphs nodes with
low degree may indicate potential errors.
PreMOn: a Lemon Extension for ExposingPredicate Models as Linked Data
Francesco Corcoglioniti, Marco Rospocher, AlessioPalmero Aprosio and Sara Tonelli
We introduce PreMOn (predicate model for ontologies), a
linguistic resource for exposing predicate models (PropBank,
NomBank, VerbNet, and FrameNet) and mappings between
them (e.g, SemLink) as Linked Open Data. It consists of two
components: (i) the PreMOn Ontology, an extension of the
lemon model by the W3C Ontology-Lexica Community Group,
that enables to homogeneously represent data from the various
predicate models; and, (ii) the PreMOn Dataset, a collection of
RDF datasets integrating various versions of the aforementioned
predicate models and mapping resources. PreMOn is freely
available and accessible online in different ways, including
through a dedicated SPARQL endpoint.
Semantic Links for Portuguese
Fabricio Chalub, Livy Real, Alexandre Rademaker andValeria de Paiva
This paper describes work on incorporating Princenton’s
WordNet morphosemantics links to the fabric of the Portuguese
OpenWordNet-PT. Morphosemantic links are relations between
verbs and derivationally related nouns that are semantically typed
(such as for tune-tuner – in Portuguese “afinar-afinador” – linked
through an “agent” link). Morphosemantic links have been
discussed for Princeton’s WordNet for a while, but have not been
added to the official database. These links are very useful, they
help us to improve our Portuguese WordNet. Thus we discuss the
integration of these links in our base and the issues we encountered
with the integration.
Creating Linked Data Morphological LanguageResources with MMoOn - The Hebrew MorphemeInventory
Bettina Klimek, Natanael Arndt, Sebastian Krause andTimotheus Arndt
The development of standard models for describing general lexical
resources has led to the emergence of numerous lexical datasets
of various languages in the Semantic Web. However, equivalent
models covering the linguistic domain of morphology do not
exist. As a result, there are hardly any language resources of
morphemic data available in RDF to date. This paper presents
the creation of the Hebrew Morpheme Inventory from a manually
compiled tabular dataset comprising around 52.000 entries. It is
an ongoing effort of representing the lexemes, word-forms and
morphologigal patterns together with their underlying relations
31
based on the newly created Multilingual Morpheme Ontology
(MMoOn). It will be shown how segmented Hebrew language
data can be granularly described in a Linked Data format, thus,
serving as an exemplary case for creating morpheme inventories
of any inflectional language with MMoOn. The resulting dataset
is described a) according to the structure of the underlying data
format, b) with respect to the Hebrew language characteristic
of building word-forms directly from roots, c) by exemplifying
how inflectional information is realized and d) with regard to its
enrichment with external links to sense resources.
O10 - Multilingual CorporaWednesday, May 25, 16:45
Chairperson: Hitoshi Isahara Oral Session
Parallel Global Voices: a Collection ofMultilingual Corpora with Citizen Media Stories
Prokopis Prokopidis, Vassilis Papavassiliou and SteliosPiperidis
We present a new collection of multilingual corpora automatically
created from the content available in the Global Voices websites,
where volunteers have been posting and translating citizen media
stories since 2004. We describe how we crawled and processed
this content to generate parallel resources comprising 302.6K
document pairs and 8.36M segment alignments in 756 language
pairs. For some language pairs, the segment alignments in this
resource are the first open examples of their kind. In an initial use
of this resource, we discuss how a set of document pair detection
algorithms performs on the Greek-English corpus.
Large Multi-lingual, Multi-level and Multi-genreAnnotation Corpus
Xuansong Li, Martha Palmer, Nianwen Xue, LanceRamshaw, Mohamed Maamouri, Ann Bies, KathrynConger, Stephen Grimes and Stephanie Strassel
High accuracy for automated translation and information retrieval
calls for linguistic annotations at various language levels.
The plethora of informal internet content sparked the demand
for porting state-of-art natural language processing (NLP)
applications to new social media as well as diverse language
adaptation. Effort launched by the BOLT (Broad Operational
Language Translation) program at DARPA (Defense Advanced
Research Projects Agency) successfully addressed the internet
information with enhanced NLP systems. BOLT aims for
automated translation and linguistic analysis for informal genres
of text and speech in online and in-person communication.
As a part of this program, the Linguistic Data Consortium
(LDC) developed valuable linguistic resources in support of
the training and evaluation of such new technologies. This
paper focuses on methodologies, infrastructure, and procedure
for developing linguistic annotation at various language levels,
including Treebank (TB), word alignment (WA), PropBank (PB),
and co-reference (CoRef). Inspired by the OntoNotes approach
with adaptations to the tasks to reflect the goals and scope of the
BOLT project, this effort has introduced more annotation types of
informal and free-style genres in English, Chinese and Egyptian
Arabic. The corpus produced is by far the largest multi-lingual,
multi-level and multi-genre annotation corpus of informal text and
speech.
C4Corpus: Multilingual Web-size Corpus withFree License
Ivan Habernal, Omnia Zayed and Iryna Gurevych
Large Web corpora containing full documents with permissive
licenses are crucial for many NLP tasks. In this article we present
the construction of 12 million-pages Web corpus (over 10 billion
tokens) licensed under CreativeCommons license family in 50+
languages that has been extracted from CommonCrawl, the largest
publicly available general Web crawl to date with about 2 billion
crawled URLs. Our highly-scalable Hadoop-based framework is
able to process the full CommonCrawl corpus on 2000+ CPU
cluster on the Amazon Elastic Map/Reduce infrastructure. The
processing pipeline includes license identification, state-of-the-art
boilerplate removal, exact duplicate and near-duplicate document
removal, and language detection. The construction of the corpus
is highly configurable and fully reproducible, and we provide both
the framework (DKPro C4CorpusTools) and the resulting data
(C4Corpus) to the research community.
OpenSubtitles2016: Extracting Large ParallelCorpora from Movie and TV Subtitles
Pierre Lison and Jörg Tiedemann
We present a new major release of the OpenSubtitles collection of
parallel corpora. The release is compiled from a large database
of movie and TV subtitles and includes a total of 1689 bitexts
spanning 2.6 billion sentences across 60 languages. The release
also incorporates a number of enhancements in the preprocessing
and alignment of the subtitles, such as the automatic correction
of OCR errors and the use of meta-data to estimate the quality of
each subtitle and score subtitle pairs.
32
O11 - LexiconsWednesday, May 25, 16:45
Chairperson: Bolette Pedersen Oral Session
LexFr: Adapting the LexIt Framework to Build aCorpus-based French Subcategorization Lexicon
Giulia Rambelli, Gianluca Lebani, Laurent Prévot andAlessandro Lenci
This paper introduces LexFr, a corpus-based French lexical
resource built by adapting the framework LexIt, originally
developed to describe the combinatorial potential of Italian
predicates. As in the original framework, the behavior of a
group of target predicates is characterized by a series of syntactic
(i.e., subcategorization frames) and semantic (i.e., selectional
preferences) statistical information (a.k.a. distributional profiles)
whose extraction process is mostly unsupervised. The first
release of LexFr includes information for 2,493 verbs, 7,939
nouns and 2,628 adjectives. In these pages we describe the
adaptation process and evaluated the final resource by comparing
the information collected for 20 test verbs against the information
available in a gold standard dictionary. In the best performing
setting, we obtained 0.74 precision, 0.66 recall and 0.70 F-
measure.
Polarity Lexicon Building: to what Extent Is theManual Effort Worth?
Iñaki San Vicente and Xabier Saralegi
Polarity lexicons are a basic resource for analyzing the sentiments
and opinions expressed in texts in an automated way. This
paper explores three methods to construct polarity lexicons:
translating existing lexicons from other languages, extracting
polarity lexicons from corpora, and annotating sentiments Lexical
Knowledge Bases. Each of these methods require a different
degree of human effort. We evaluate how much manual effort is
needed and to what extent that effort pays in terms of performance
improvement. Experiment setup includes generating lexicons for
Basque, and evaluating them against gold standard datasets in
different domains. Results show that extracting polarity lexicons
from corpora is the best solution for achieving a good performance
with reasonable human effort.
Al Qamus al Muhit, a Medieval Arabic Lexicon inLMF
Ouafae Nahli, Francesca Frontini, Monica Monachini,Fahad Khan, Arsalan Zarghili and Mustapha Khalfi
This paper describes the conversion into LMF, a standard
lexicographic digital format of ’al-qams al-mui, a Medieval
Arabic lexicon. The lexicon is first described, then all the steps
required for the conversion are illustrated. The work is will
produce a useful lexicographic resource for Arabic NLP, but is
also interesting per se, to study the implications of adapting the
LMF model to the Arabic language. Some reflections are offered
as to the status of roots with respect to previously suggested
representations. In particular, roots are, in our opinion are to be
not treated as lexical entries, but modeled as lexical metadata for
classifying and identifying lexical entries. In this manner, each
root connects all entries that are derived from it.
CASSAurus: A Resource of Simpler SpanishSynonyms
Ricardo Baeza-Yates, Luz Rello and Julia Dembowski
In this work we introduce and describe a language resource
composed of lists of simpler synonyms for Spanish. The
synonyms are divided in different senses taken from the Spanish
OpenThesaurus, where context disambiguation was performed by
using statistical information from the Web and Google Books
Ngrams. This resource is freely available online and can be used
for different NLP tasks such as lexical simplification. Indeed, so
far it has been already integrated into four tools.
O12 - OCR for Historical TextWednesday, May 25, 16:45
Chairperson: Thierry Declerck Oral Session
Measuring Lexical Quality of a Historical FinnishNewspaper Collection – Analysis of Garbled OCRData with Basic Language Technology Tools andMeans
Kimmo Kettunen and Tuula Pääkkönen
The National Library of Finland has digitized a large proportion
of the historical newspapers published in Finland between 1771
and 1910 (Bremer-Laamanen 2001). This collection contains
approximately 1.95 million pages in Finnish and Swedish.
Finnish part of the collection consists of about 2.39 billion
words. The National Library’s Digital Collections are offered
via the digi.kansalliskirjasto.fi web service, also known as Digi.
Part of this material is also available freely downloadable in
The Language Bank of Finland provided by the Fin-CLARIN
consortium . The collection can also be accessed through the
Korp environment that has been developed by Språkbanken at the
University of Gothenburg and extended by FIN-CLARIN team
at the University of Helsinki to provide concordances of text
resources. A Cranfield-style information retrieval test collection
has been produced out of a small part of the Digi newspaper
material at the University of Tampere (Järvelin et al., 2015). The
33
quality of the OCRed collections is an important topic in digital
humanities, as it affects general usability and searchability of
collections. There is no single available method to assess the
quality of large collections, but different methods can be used
to approximate the quality. This paper discusses different corpus
analysis style ways to approximate the overall lexical quality of
the Finnish part of the Digi collection.
Using SMT for OCR Error Correction ofHistorical Texts
Haithem Afli, Zhengwei Qiu, Andy Way and PáraicSheridan
A trend to digitize historical paper-based archives has emerged
in recent years, with the advent of digital optical scanners. A
lot of paper-based books, textbooks, magazines, articles, and
documents are being transformed into electronic versions that
can be manipulated by a computer. For this purpose, Optical
Character Recognition (OCR) systems have been developed
to transform scanned digital text into editable computer text.
However, different kinds of errors in the OCR system output
text can be found, but Automatic Error Correction tools can help
in performing the quality of electronic texts by cleaning and
removing noises. In this paper, we perform a qualitative and
quantitative comparison of several error-correction techniques for
historical French documents. Experimentation shows that our
Machine Translation for Error Correction method is superior to
other Language Modelling correction techniques, with nearly 13%
relative improvement compared to the initial baseline.
OCR Post-Correction Evaluation of Early DutchBooks Online - Revisited
Martin Reynaert
We present further work on evaluation of the fully automatic
post-correction of Early Dutch Books Online, a collection of
10,333 18th century books. In prior work we evaluated the new
implementation of Text-Induced Corpus Clean-up (TICCL) on the
basis of a single book Gold Standard derived from this collection.
In the current paper we revisit the same collection on the basis of a
sizeable 1020 item random sample of OCR post-corrected strings
from the full collection. Both evaluations have their own stories
to tell and lessons to teach.
Crowdsourcing an OCR Gold Standard for aGerman and French Heritage Corpus
Simon Clematide, Lenz Furrer and Martin Volk
Crowdsourcing approaches for post-correction of OCR output
(Optical Character Recognition) have been successfully applied
to several historic text collections. We report on our crowd-
correction platform Kokos, which we built to improve the OCR
quality of the digitized yearbooks of the Swiss Alpine Club
(SAC) from the 19th century. This multilingual heritage corpus
consists of Alpine texts mainly written in German and French,
all typeset in Antiqua font. Finding and engaging volunteers for
correcting large amounts of pages into high quality text requires
a carefully designed user interface, an easy-to-use workflow, and
continuous efforts for keeping the participants motivated. More
than 180,000 characters on about 21,000 pages were corrected by
volunteers in about 7 month, achieving an OCR gold standard with
a systematically evaluated accuracy of 99.7% on the word level.
The crowdsourced OCR gold standard and the corresponding
original OCR recognition results from Abby FineReader 7 for
each page are available as a resource. Additionally, the scanned
images (300dpi) of all pages are included in order to facilitate tests
with other OCR software.
P10 - Discourse (1)Wednesday, May 25, 16:45
Chairperson: Elena Cabrio Poster Session
Argument Mining: the Bottleneck of Knowledgeand Language Resources
Patrick Saint-Dizier
Given a controversial issue, argument mining from natural
language texts (news papers, and any form of text on the
Internet) is extremely challenging: domain knowledge is often
required together with appropriate forms of inferences to identify
arguments. This contribution explores the types of knowledge that
are required and how they can be paired with reasoning schemes,
language processing and language resources to accurately mine
arguments. We show via corpus analysis that the Generative
Lexicon, enhanced in different manners and viewed as both a
lexicon and a domain knowledge representation, is a relevant
approach. In this paper, corpus annotation for argument mining
is first developed, then we show how the generative lexicon
approach must be adapted and how it can be paired with language
processing patterns to extract and specify the nature of arguments.
Our approach to argument mining is thus knowledge driven
From Interoperable Annotations towardsInteroperable Resources: A MultilingualApproach to the Analysis of Discourse
Ekaterina Lapshinova-Koltunski, Kerstin Anna Kunz andAnna Nedoluzhko
In the present paper, we analyse variation of discourse phenomena
in two typologically different languages, i.e. in German and
34
Czech. The novelty of our approach lies in the nature of
the resources we are using. Advantage is taken of existing
resources, which are, however, annotated on the basis of two
different frameworks. We use an interoperable scheme unifying
discourse phenomena in both frameworks into more abstract
categories and considering only those phenomena that have a
direct match in German and Czech. The discourse properties
we focus on are relations of identity, semantic similarity, ellipsis
and discourse relations. Our study shows that the application of
interoperable schemes allows an exploitation of discourse-related
phenomena analysed in different projects and on the basis of
different frameworks. As corpus compilation and annotation is
a time-consuming task, positive results of this experiment open up
new paths for contrastive linguistics, translation studies and NLP,
including machine translation.
Falling silent, lost for words ... Tracing personalinvolvement in interviews with Dutch warveterans
Henk van den Heuvel and Nelleke Oostdijk
In sources used in oral history research (such as interviews
with eye witnesses), passages where the degree of personal
emotional involvement is found to be high can be of particular
interest, as these may give insight into how historical events
were experienced, and what moral dilemmas and psychological
or religious struggles were encountered. In a pilot study involving
a large corpus of interview recordings with Dutch war veterans,
we have investigated if it is possible to develop a method for
automatically identifying those passages where the degree of
personal emotional involvement is high. The method is based
on the automatic detection of exceptionally large silences and
filled pause segments (using Automatic Speech Recognition), and
cues taken from specific n-grams. The first results appear to be
encouraging enough for further elaboration of the method.
A Bilingual Discourse Corpus and Its Applications
yang liu, Jiajun Zhang, Chengqing Zong, Yating Yang andXi Zhou
Existing discourse research only focuses on the monolingual
languages and the inconsistency between languages limits the
power of the discourse theory in multilingual applications such as
machine translation. To address this issue, we design and build a
bilingual discource corpus in which we are currently defining and
annotating the bilingual elementary discourse units (BEDUs). The
BEDUs are then organized into hierarchical structures. Using this
discourse style, we have annotated nearly 20K LDC sentences.
Finally, we design a bilingual discourse based method for machine
translation evaluation and show the effectiveness of our bilingual
discourse annotations.
Adding Semantic Relations to a Large-CoverageConnective Lexicon of GermanTatjana Scheffler and Manfred Stede
DiMLex is a lexicon of German connectives that can be used
for various language understanding purposes. We enhanced the
coverage to 275 connectives, which we regard as covering all
known German discourse connectives in current use. In this
paper, we consider the task of adding the semantic relations
that can be expressed by each connective. After discussing
different approaches to retrieving semantic information, we settle
on annotating each connective with senses from the new PDTB
3.0 sense hierarchy. We describe our new implementation in
the extended DiMLex, which will be available for research
purposes.
Corpus Resources for Dispute MediationDiscourseMathilde Janier and Chris Reed
Dispute mediation is a growing activity in the resolution of
conflicts, and more and more research emerge to enhance and
better understand this (until recently) understudied practice.
Corpus analyses are necessary to study discourse in this context;
yet, little data is available, mainly because of its confidentiality
principle. After proposing hints and avenues to acquire transcripts
of mediation sessions, this paper presents the Dispute Mediation
Corpus, which gathers annotated excerpts of mediation dialogues.
Although developed as part of a project on argumentation, it
is freely available and the text data can be used by anyone.
This first-ever open corpus of mediation interactions can be of
interest to scholars studying discourse, but also conflict resolution,
argumentation, linguistics, communication, etc. We advocate
for using and extending this resource that may be valuable to a
large variety of domains of research, particularly those striving
to enhance the study of the rapidly growing activity of dispute
mediation.
A Tagged Corpus for Automatic Labeling ofDisabilities in Medical Scientific PapersCarlos Valmaseda, Juan Martinez-Romo and LourdesAraujo
This paper presents the creation of a corpus of labeled disabilities
in scientific papers. The identification of medical concepts in
documents and, especially, the identification of disabilities, is a
complex task mainly due to the variety of expressions that can
make reference to the same problem. Currently there is not a set
of documents manually annotated with disabilities with which to
evaluate an automatic detection system of such concepts. This
is the reason why this corpus arises, aiming to facilitate the
35
evaluation of systems that implement an automatic annotation tool
for extracting biomedical concepts such as disabilities. The result
is a set of scientific papers manually annotated. For the selection
of these scientific papers has been conducted a search using a
list of rare diseases, since they generally have associated several
disabilities of different kinds.
PersonaBank: A Corpus of Personal Narrativesand Their Story Intention Graphs
Stephanie Lukin, Kevin Bowden, Casey Barackman andMarilyn Walker
We present a new corpus, PersonaBank, consisting of 108 personal
stories from weblogs that have been annotated with their Story
Intention Graphs, a deep representation of the content of a
story. We describe the topics of the stories and the basis of
the Story Intention Graph representation, as well as the process
of annotating the stories to produce the Story Intention Graphs
and the challenges of adapting the tool to this new personal
narrative domain. We also discuss how the corpus can be used in
applications that retell the story using different styles of tellings,
co-tellings, or as a content planner.
Fine-Grained Chinese Discourse RelationLabelling
Huan-Yuan Chen, Wan-Shan Liao, Hen-Hsen Huang andHsin-Hsi Chen
This paper explores several aspects together for a fine-grained
Chinese discourse analysis. We deal with the issues of ambiguous
discourse markers, ambiguous marker linkings, and more than
one discourse marker. A universal feature representation is
proposed. The pair-once postulation, cross-discourse-unit-first
rule and word-pair-marker-first rule select a set of discourse
markers from ambiguous linkings. Marker-Sum feature considers
total contribution of markers and Marker-Preference feature
captures the probability distribution of discourse functions of a
representative marker by using preference rule. The HIT Chinese
discourse relation treebank (HIT-CDTB) is used to evaluate the
proposed models. The 25-way classifier achieves 0.57 micro-
averaged F-score.
Annotating Discourse Relations in SpokenLanguage: A Comparison of the PDTB and CCRFrameworks
Ines Rehbein, Merel Scholman and Vera Demberg
In discourse relation annotation, there is currently a variety of
different frameworks being used, and most of them have been
developed and employed mostly on written data. This raises
a number of questions regarding interoperability of discourse
relation annotation schemes, as well as regarding differences in
discourse annotation for written vs. spoken domains. In this
paper, we describe ouron annotating two spoken domains from
the SPICE Ireland corpus (telephone conversations and broadcast
interviews) according todifferent discourse annotation schemes,
PDTB 3.0 and CCR. We show that annotations in the two schemes
can largely be mappedone another, and discuss differences in
operationalisations of discourse relation schemes which present
a challenge to automatic mapping. We also observe systematic
differences in the prevalence of implicit discourse relations in
spoken data compared to written texts,find that there are also
differences in the types of causal relations between the domains.
Finally, we find that PDTB 3.0 addresses many shortcomings of
PDTB 2.0 wrt. the annotation of spoken discourse, and suggest
further extensions. The new corpus has roughly theof the CoNLL
2015 Shared Task test set, and we hence hope that it will be
a valuable resource for the evaluation of automatic discourse
relation labellers.
Enhancing The RATP-DECODA Corpus WithLinguistic Annotations For Performing A LargeRange Of NLP Tasks
Carole Lailler, Anaïs Landeau, Frédéric Béchet, YannickEstève and Paul Deléglise
In this article, we present the RATP-DECODA Corpus
which is composed by a set of 67 hours of speech from
telephone conversations of a Customer Care Service (CCS).
This corpus is already available on line at http://sldr.
org/sldr000847/fr in its first version. However, many
enhancements have been made in order to allow the development
of automatic techniques to transcript conversations and to capture
their meaning. These enhancements fall into two categories:
firstly, we have increased the size of the corpus with manual
transcriptions from a new operational day; secondly we have
added new linguistic annotations to the whole corpus (either
manually or through an automatic processing) in order to perform
various linguistic tasks from syntactic and semantic parsing to
dialog act tagging and dialog summarization.
Parallel Discourse Annotations on a Corpus ofShort Texts
Manfred Stede, Stergos Afantenos, Andreas Peldszus,Nicholas Asher and Jérémy Perret
We present the first corpus of texts annotated with two
alternative approaches to discourse structure, Rhetorical Structure
Theory (Mann and Thompson, 1988) and Segmented Discourse
Representation Theory (Asher and Lascarides, 2003). 112
short argumentative texts have been analyzed according to these
36
two theories. Furthermore, in previous work, the same texts
have already been annotated for their argumentation structure,
according to the scheme of Peldszus and Stede (2013). This
corpus therefore enables studies of correlations between the
two accounts of discourse structure, and between discourse and
argumentation. We converted the three annotation formats to
a common dependency tree format that enables to compare the
structures, and we describe some initial findings.
An Annotated Corpus of Direct Speech
John Lee and Chak Yan Yeung
We propose a scheme for annotating direct speech in
literary texts, based on the Text Encoding Initiative (TEI)
and the coreference annotation guidelines from the Message
Understanding Conference (MUC). The scheme encodes the
speakers and listeners of utterances in a text, as well as the
quotative verbs that reports the utterances. We measure inter-
annotator agreement on this annotation task. We then present
statistics on a manually annotated corpus that consists of books
from the New Testament. Finally, we visualize the corpus as a
conversational network.
P11 - Morphology (1)Wednesday, May 25, 16:45
Chairperson: Éric de la Clergerie Poster Session
Evaluating the Noisy Channel Model for theNormalization of Historical Texts: Basque,Spanish and Slovene
Izaskun Etxeberria, Iñaki Alegria, Larraitz Uria and MansHulden
This paper presents a method for the normalization of historical
texts using a combination of weighted finite-state transducers and
language models. We have extended our previous work on the
normalization of dialectal texts and tested the method against a
17th century literary work in Basque. This preprocessed corpus
is made available in the LREC repository. The performance
of this method for learning relations between historical and
contemporary word forms is evaluated against resources in three
languages. The method we present learns to map phonological
changes using a noisy channel model. The model is based
on techniques commonly used for phonological inference and
producing Grapheme-to-Grapheme conversion systems encoded
as weighted transducers and produces F-scores above 80% in the
task for Basque. A wider evaluation shows that the approach
performs equally well with all the languages in our evaluation
suite: Basque, Spanish and Slovene. A comparison against other
methods that address the same task is also provided.
Farasa: A New Fast and Accurate Arabic WordSegmenter
Kareem Darwish and Hamdy Mubarak
In this paper, we present Farasa (meaning insight in Arabic),
which is a fast and accurate Arabic segmenter. Segmentation
involves breaking Arabic words into their constituent clitics.
Our approach is based on SVMrank using linear kernels. The
features that we utilized account for: likelihood of stems,
prefixes, suffixes, and their combination; presence in lexicons
containing valid stems and named entities; and underlying stem
templates. Farasa outperforms or equalizes state-of-the-art Arabic
segmenters, namely QATARA and MADAMIRA. Meanwhile,
Farasa is nearly one order of magnitude faster than QATARA
and two orders of magnitude faster than MADAMIRA. The
segmenter should be able to process one billion words in less than
5 hours. Farasa is written entirely in native Java, with no external
dependencies, and is open-source.
A Morphological Lexicon of Esperanto withMorpheme Frequencies
Eckhard Bick
This paper discusses the internal structure of complex Esperanto
words (CWs). Using a morphological analyzer, possible affixation
and compounding is checked for over 50,000 Esperanto lexemes
against a list of 17,000 root words. Morpheme boundaries
in the resulting analyses were then checked manually, creating
a CW dictionary of 28,000 words, representing 56.4% of the
lexicon, or 19.4% of corpus tokens. The error percentage of
the EspGram morphological analyzer for new corpus CWs was
4.3% for types and 6.4% for tokens, with a recall of almost
100%, and wrong/spurious boundaries being more common
than missing ones. For pedagogical purposes a morpheme
frequency dictionary was constructed for a 16 million word
corpus, confirming the importance of agglutinative derivational
morphemes in the Esperanto lexicon. Finally, as a means to reduce
the morphological ambiguity of CWs, we provide POS likelihoods
for Esperanto suffixes.
How does Dictionary Size Influence Performanceof Vietnamese Word Segmentation?
Wuying Liu and Lin Wang
Vietnamese word segmentation (VWS) is a challenging basic
issue for natural language processing. This paper addresses the
problem of how does dictionary size influence VWS performance,
proposes two novel measures: square overlap ratio (SOR)
37
and relaxed square overlap ratio (RSOR), and validates their
effectiveness. The SOR measure is the product of dictionary
overlap ratio and corpus overlap ratio, and the RSOR measure
is the relaxed version of SOR measure under an unsupervised
condition. The two measures both indicate the suitable degree
between segmentation dictionary and object corpus waiting for
segmentation. The experimental results show that the more
suitable, neither smaller nor larger, dictionary size is better
to achieve the state-of-the-art performance for dictionary-based
Vietnamese word segmenters.
Giving Lexical Resources a Second Life:Démonette, a Multi-sourced Morpho-semanticNetwork for French
Nabil Hathout and Fiammetta Namer
Démonette is a derivational morphological network designed for
the description of French. Its original architecture enables its
use as a formal framework for the description of morphological
analyses and as a repository for existing lexicons. It is fed with
a variety of resources, which all are already validated. The
harmonization of their content into a unified format provides them
a second life, in which they are enriched with new properties,
provided these are deductible from their contents. Démonette
is released under a Creative Commons license. It is usable for
theoretical and descriptive research in morphology, as a source
of experimental material for psycholinguistics, natural language
processing (NLP) and information retrieval (IR), where it fills a
gap, since French lacks a large-coverage derivational resources
database. The article presents the integration of two existing
lexicons into Démonette. The first is Verbaction, a lexicon of
deverbal action nouns. The second is Lexeur, a database of agent
nouns in -eur derived from verbs or from nouns.
Syntactic Analysis of Phrasal Compounds inCorpora: a Challenge for NLP Tools
Carola Trips
The paper introduces a “train once, use many” approach for the
syntactic analysis of phrasal compounds (PC) of the type XP+N
like “Would you like to sit on my knee?” nonsense. PCs are a
challenge for NLP tools since they require the identification of a
syntactic phrase within a morphological complex. We propose
a method which uses a state-of-the-art dependency parser not
only to analyse sentences (the environment of PCs) but also
to compound the non-head of PCs in a well-defined particular
condition which is the analysis of the non-head spanning from
the left boundary (mostly marked by a determiner) to the nominal
head of the PC. This method contains the following steps: (a)
the use an English state-of-the-art dependency parser with data
comprising sentences with PCs from the British National Corpus
(BNC), (b) the detection of parsing errors of PCs, (c) the separate
treatment of the non-head structure using the same model, and
(d) the attachment of the non-head to the compound head. The
evaluation of the method showed that the accuracy of 76% could
be improved by adding a step in the PC compounder module
which specified user-defined contexts being sensitive to the part
of speech of the non-head parts and by using TreeTagger, in line
with our approach.
DALILA: The Dialectal Arabic LinguisticLearning AssistantSalam Khalifa, Houda Bouamor and Nizar Habash
Dialectal Arabic (DA) poses serious challenges for Natural
Language Processing (NLP). The number and sophistication of
tools and datasets in DA are very limited in comparison to
Modern Standard Arabic (MSA) and other languages. MSA
tools do not effectively model DA which makes the direct use
of MSA NLP tools for handling dialects impractical. This
is particularly a challenge for the creation of tools to support
learning Arabic as a living language on the web, where authentic
material can be found in both MSA and DA. In this paper,
we present the Dialectal Arabic Linguistic Learning Assistant
(DALILA), a Chrome extension that utilizes cutting-edge Arabic
dialect NLP research to assist learners and non-native speakers
in understanding text written in either MSA or DA. DALILA
provides dialectal word analysis and English gloss corresponding
to each word.
Refurbishing a Morphological Database forGermanPetra Steiner
The CELEX database is one of the standard lexical resources for
German. It yields a wealth of data especially for phonological and
morphological applications. The morphological part comprises
deep-structure morphological analyses of German. However, as
it was developed in the Nineties, both encoding and spelling
are outdated. About one fifth of over 50,000 datasets contain
umlauts and signs such as ß. Changes to a modern version
cannot be obtained by simple substitution. In this paper, we
shortly describe the original content and form of the orthographic
and morphological database for German in CELEX. Then we
present our work on modernizing the linguistic data. Lemmas
and morphological analyses are transferred to a modern standard
of encoding by first merging orthographic and morphological
information of the lemmas and their entries and then performing
a second substitution for the morphs within their morphological
analyses. Changes to modern German spelling are performed
by substitution rules according to orthographical standards. We
show an example of the use of the data for the disambiguation of
38
morphological structures. The discussion describes prospects of
future work on this or similar lexicons. The Perl script is publicly
available on our website.
P12 - Sentiment Analysis and OpinionMining (1)Wednesday, May 25, 16:45
Chairperson: German Rigau Poster Session
Encoding Adjective Scales for Fine-grainedResources
Cédric Lopez, Frederique Segond and Christiane Fellbaum
We propose an automatic approach towards determining the
relative location of adjectives on a common scale based on their
strength. We focus on adjectives expressing different degrees
of goodness occurring in French product (perfumes) reviews.
Using morphosyntactic patterns, we extract from the reviews short
phrases consisting of a noun that encodes a particular aspect of
the perfume and an adjective modifying that noun. We then
associate each such n-gram with the corresponding product aspect
and its related star rating. Next, based on the star scores, we
generate adjective scales reflecting the relative strength of specific
adjectives associated with a shared attribute of the product. An
automatic ordering of the adjectives “correct” (correct), “sympa”
(nice), “bon” (good) and “excellent” (excellent) according to their
score in our resource is consistent with an intuitive scale based on
human judgments. Our long-term objective is to generate different
adjective scales in an empirical manner, which could allow the
enrichment of lexical resources.
SCARE – The Sentiment Corpus of App Reviewswith Fine-grained Annotations in German
Mario Sänger, Ulf Leser, Steffen Kemmerer, Peter Adolphsand Roman Klinger
The automatic analysis of texts containing opinions of users about,
e.g., products or political views has gained attention within the
last decades. However, previous work on the task of analyzing
user reviews about mobile applications in app stores is limited.
Publicly available corpora do not exist, such that a comparison
of different methods and models is difficult. We fill this gap by
contributing the Sentiment Corpus of App Reviews (SCARE),
which contains fine-grained annotations of application aspects,
subjective (evaluative) phrases and relations between both. This
corpus consists of 1,760 annotated application reviews from
the Google Play Store with 2,487 aspects and 3,959 subjective
phrases. We describe the process and methodology how the
corpus was created. The Fleiss Kappa between four annotators
reveals an agreement of 0.72. We provide a strong baseline
with a linear-chain conditional random field and word-embedding
features with a performance of 0.62 for aspect detection and 0.63
for the extraction of subjective phrases. The corpus is available to
the research community to support the development of sentiment
analysis methods on mobile application reviews.
Datasets for Aspect-Based Sentiment Analysis inFrenchMarianna Apidianaki, Xavier Tannier and Cécile Richart
Aspect Based Sentiment Analysis (ABSA) is the task of mining
and summarizing opinions from text about specific entities
and their aspects. This article describes two datasets for the
development and testing of ABSA systems for French which
comprise user reviews annotated with relevant entities, aspects and
polarity values. The first dataset contains 457 restaurant reviews
(2365 sentences) for training and testing ABSA systems, while the
second contains 162 museum reviews (655 sentences) dedicated
to out-of-domain evaluation. Both datasets were built as part of
SemEval-2016 Task 5 “Aspect-Based Sentiment Analysis”where
seven different languages were represented, and are publicly
available for research purposes.
ANEW+: Automatic Expansion and Validation ofAffective Norms of Words Lexicons in MultipleLanguagesSamira Shaikh, Kit Cho, Tomek Strzalkowski, LaurieFeldman, John Lien, Ting Liu and George Aaron Broadwell
In this article we describe our method of automatically expanding
an existing lexicon of words with affective valence scores. The
automatic expansion process was done in English. In addition,
we describe our procedure for automatically creating lexicons in
languages where such resources may not previously exist. The
foreign languages we discuss in this paper are Spanish, Russian
and Farsi. We also describe the procedures to systematically
validate our newly created resources. The main contributions of
this work are: 1) A general method for expansion and creation of
lexicons with scores of words on psychological constructs such as
valence, arousal or dominance; and 2) a procedure for ensuring
validity of the newly constructed resources.
PotTS: The Potsdam Twitter Sentiment CorpusUladzimir Sidarenka
In this paper, we introduce a novel comprehensive dataset of 7,992
German tweets, which were manually annotated by two human
experts with fine-grained opinion relations. A rich annotation
scheme used for this corpus includes such sentiment-relevant
elements as opinion spans, their respective sources and targets,
emotionally laden terms with their possible contextual negations
39
and modifiers. Various inter-annotator agreement studies, which
were carried out at different stages of work on these data (at the
initial training phase, upon an adjudication step, and after the
final annotation run), reveal that labeling evaluative judgements
in microblogs is an inherently difficult task even for professional
coders. These difficulties, however, can be alleviated by letting
the annotators revise each other’s decisions. Once rechecked,
the experts can proceed with the annotation of further messages,
staying at a fairly high level of agreement.
Challenges of Evaluating Sentiment AnalysisTools on Social Media
Diana Maynard and Kalina Bontcheva
This paper discusses the challenges in carrying out fair
comparative evaluations of sentiment analysis systems. Firstly,
these are due to differences in corpus annotation guidelines
and sentiment class distribution. Secondly, different systems
often make different assumptions about how to interpret certain
statements, e.g. tweets with URLs. In order to study the impact of
these on evaluation results, this paper focuses on tweet sentiment
analysis in particular. One existing and two newly created corpora
are used, and the performance of four different sentiment analysis
systems is reported; we make our annotated datasets and sentiment
analysis applications publicly available. We see considerable
variations in results across the different corpora, which calls
into question the validity of many existing annotated datasets
and evaluations, and we make some observations about both the
systems and the datasets as a result.
EmoTweet-28: A Fine-Grained Emotion Corpusfor Sentiment Analysis
Jasy Suet Yan Liew, Howard R. Turtle and Elizabeth D.Liddy
This paper describes EmoTweet-28, a carefully curated corpus
of 15,553 tweets annotated with 28 emotion categories for the
purpose of training and evaluating machine learning models for
emotion classification. EmoTweet-28 is, to date, the largest
tweet corpus annotated with fine-grained emotion categories.
The corpus contains annotations for four facets of emotion:
valence, arousal, emotion category and emotion cues. We first
used small-scale content analysis to inductively identify a set of
emotion categories that characterize the emotions expressed in
microblog text. We then expanded the size of the corpus using
crowdsourcing. The corpus encompasses a variety of examples
including explicit and implicit expressions of emotions as well as
tweets containing multiple emotions. EmoTweet-28 represents an
important resource to advance the development and evaluation of
more emotion-sensitive systems.
Happy Accident: A Sentiment CompositionLexicon for Opposing Polarity Phrases
Svetlana Kiritchenko and Saif Mohammad
Sentiment composition is the determining of sentiment of a multi-
word linguistic unit, such as a phrase or a sentence, based on
its constituents. We focus on sentiment composition in phrases
formed by at least one positive and at least one negative word —
phrases like ’happy accident’ and ’best winter break’. We refer to
such phrases as opposing polarity phrases. We manually annotate
a collection of opposing polarity phrases and their constituent
single words with real-valued sentiment intensity scores using a
method known as Best–Worst Scaling. We show that the obtained
annotations are consistent. We explore the entries in the lexicon
for linguistic regularities that govern sentiment composition in
opposing polarity phrases. Finally, we list the current and possible
future applications of the lexicon.
Detecting Implicit Expressions of Affect from Textusing Semantic Knowledge on Common ConceptProperties
Alexandra Balahur and Hristo Tanev
Emotions are an important part of the human experience. They are
responsible for the adaptation and integration in the environment,
offering, most of the time together with the cognitive system,
the appropriate responses to stimuli in the environment. As
such, they are an important component in decision-making
processes. In today’s society, the avalanche of stimuli present in
the environment (physical or virtual) makes people more prone to
respond to stronger affective stimuli (i.e., those that are related
to their basic needs and motivations – survival, food, shelter,
etc.). In media reporting, this is translated in the use of arguments
(factual data) that are known to trigger specific (strong, affective)
behavioural reactions from the readers. This paper describes
initial efforts to detect such arguments from text, based on the
properties of concepts. The final system able to retrieve and
label this type of data from the news in traditional and social
platforms is intended to be integrated Europe Media Monitor
family of applications to detect texts that trigger certain (especially
negative) reactions from the public, with consequences on citizen
safety and security.
Creating a General Russian Sentiment Lexicon
Natalia Loukachevitch and Anatolii Levchik
The paper describes the new Russian sentiment lexicon -
RuSentiLex. The lexicon was gathered from several sources:
40
opinionated words from domain-oriented Russian sentiment
vocabularies, slang and curse words extracted from Twitter,
objective words with positive or negative connotations from a
news collection. The words in the lexicon having different
sentiment orientations in specific senses are linked to appropriate
concepts of the thesaurus of Russian language RuThes. All
lexicon entries are classified according to four sentiment
categories and three sources of sentiment (opinion, emotion,
or fact). The lexicon can serve as the first version for the
construction of domain-specific sentiment lexicons or can be used
for feature generation in machine-learning approaches. In this
role, the RuSentiLex lexicon was utilized by the participants of
the SentiRuEval-2016 Twitter reputation monitoring shared task
and allowed them to achieve high results.
GRaSP: A Multilayered Annotation Scheme forPerspectives
Chantal van Son, Tommaso Caselli, Antske Fokkens, IsaMaks, Roser Morante, Lora Aroyo and Piek Vossen
This paper presents a framework and methodology for the
annotation of perspectives in text. In the last decade, different
aspects of linguistic encoding of perspectives have been targeted
as separated phenomena through different annotation initiatives.
We propose an annotation scheme that integrates these different
phenomena. We use a multilayered annotation approach, splitting
the annotation of different aspects of perspectives into small
subsequent subtasks in order to reduce the complexity of the
task and to better monitor interactions between layers. Currently,
we have included four layers of perspective annotation: events,
attribution, factuality and opinion. The annotations are integrated
in a formal model called GRaSP, which provides the means to
represent instances (e.g. events, entities) and propositions in the
(real or assumed) world in relation to their mentions in text. Then,
the relation between the source and target of a perspective is
characterized by means of perspective annotations. This enables
us to place alternative perspectives on the same entity, event or
proposition next to each other.
Integration of Lexical and Semantic Knowledgefor Sentiment Analysis in SMS
Wejdene Khiari, Mathieu Roche and Asma Bouhafs Hafsia
With the explosive growth of online social media (forums, blogs,
and social networks), exploitation of these new information
sources has become essential. Our work is based on the
sud4science project. The goal of this project is to perform
multidisciplinary work on a corpus of authentic SMS, in French,
collected in 2011 and anonymised (88milSMS corpus: http:
//88milsms.huma-num.fr). This paper highlights a new
method to integrate opinion detection knowledge from an SMS
corpus by combining lexical and semantic information. More
precisely, our approach gives more weight to words with a
sentiment (i.e. presence of words in a dedicated dictionary) for
a classification task based on three classes: positive, negative,
and neutral. The experiments were conducted on two corpora: an
elongated SMS corpus (i.e. repetitions of characters in messages)
and a non-elongated SMS corpus. We noted that non-elongated
SMS were much better classified than elongated SMS. Overall,
this study highlighted that the integration of semantic knowledge
always improves classification.
Specialising Paragraph Vectors for Text PolarityDetection
Fabio Tamburini
This paper presents some experiments for specialising Paragraph
Vectors, a new technique for creating text fragment (phrase,
sentence, paragraph, text, ...) embedding vectors, for text polarity
detection. The first extension regards the injection of polarity
information extracted from a polarity lexicon into embeddings and
the second extension aimed at inserting word order information
into Paragraph Vectors. These two extensions, when training a
logistic-regression classifier on the combined embeddings, were
able to produce a relevant gain in performance when compared
to the standard Paragraph Vector methods proposed by Le and
Mikolov (2014).
Evaluating Lexical Similarity to build SentimentSimilarity
Grégoire Jadi, Vincent Claveau, Béatrice Daille andMonceaux Laura
In this article, we propose to evaluate the lexical similarity
information provided by word representations against several
opinion resources using traditional Information Retrieval tools.
Word representation have been used to build and to extend
opinion resources such as lexicon, and ontology and their
performance have been evaluated on sentiment analysis tasks.
We question this method by measuring the correlation between
the sentiment proximity provided by opinion resources and
the semantic similarity provided by word representations using
different correlation coefficients. We also compare the neighbors
found in word representations and list of similar opinion words.
Our results show that the proximity of words in state-of-the-
art word representations is not very effective to build sentiment
similarity.
41
P13 - Semantics (1)Wednesday, May 25, 16:45
Chairperson: Christian Chiarcos Poster Session
Visualisation and Exploration ofHigh-Dimensional Distributional Features inLexical Semantic Classification
Maximilian Köper, Melanie Zaiß, Qi Han, Steffen Koch andSabine Schulte im Walde
Vector space models and distributional information are widely
used in NLP. The models typically rely on complex, high-
dimensional objects. We present an interactive visualisation tool
to explore salient lexical-semantic features of high-dimensional
word objects and word similarities. Most visualisation tools
provide only one low-dimensional map of the underlying data, so
they are not capable of retaining the local and the global structure.
We overcome this limitation by providing an additional trust-view
to obtain a more realistic picture of the actual object distances.
Additional tool options include the reference to a gold standard
classification, the reference to a cluster analysis as well as listing
the most salient (common) features for a selected subset of the
words.
SemAligner: A Method and Tool for AligningChunks with Semantic Relation Types andSemantic Similarity Scores
Nabin Maharjan, Rajendra Banjade, Nobal Bikram Niraulaand Vasile Rus
This paper introduces a ruled-based method and software tool,
called SemAligner, for aligning chunks across texts in a given
pair of short English texts. The tool, based on the top
performing method at the Interpretable Short Text Similarity
shared task at SemEval 2015, where it was used with human
annotated (gold) chunks, can now additionally process plain text-
pairs using two powerful chunkers we developed, e.g. using
Conditional Random Fields. Besides aligning chunks, the tool
automatically assigns semantic relations to the aligned chunks
(such as EQUI for equivalent and OPPO for opposite) and
semantic similarity scores that measure the strength of the
semantic relation between the aligned chunks. Experiments show
that SemAligner performs competitively for system generated
chunks and that these results are also comparable to results
obtained on gold chunks. SemAligner has other capabilities
such as handling various input formats and chunkers as well as
extending lookup resources.
Aspectual Flexibility Increases with Agentivityand Concreteness A Computational ClassificationExperiment on Polysemous Verbs
Ingrid Falk and Fabienne Martin
We present an experimental study making use of a machine
learning approach to identify the factors that affect the aspectual
value that characterizes verbs under each of their readings. The
study is based on various morpho-syntactic and semantic features
collected from a French lexical resource and on a gold standard
aspectual classification of verb readings designed by an expert.
Our results support the tested hypothesis, namely that agentivity
and abstractness influence lexical aspect.
mwetoolkit+sem: Integrating Word Embeddingsin the mwetoolkit for Semantic MWE Processing
Silvio Cordeiro, Carlos Ramisch and Aline Villavicencio
This paper presents mwetoolkit+sem: an extension of the
mwetoolkit that estimates semantic compositionality scores for
multiword expressions (MWEs) based on word embeddings.
First, we describe our implementation of vector-space operations
working on distributional vectors. The compositionality score is
based on the cosine distance between the MWE vector and the
composition of the vectors of its member words. Our generic
system can handle several types of word embeddings and MWE
lists, and may combine individual word representations using
several composition techniques. We evaluate our implementation
on a dataset of 1042 English noun compounds, comparing
different configurations of the underlying word embeddings and
word-composition models. We show that our vector-based scores
model non-compositionality better than standard association
measures such as log-likelihood.
Cognitively Motivated DistributionalRepresentations of Meaning
Elias Iosif, Spiros Georgiladakis and AlexandrosPotamianos
Although meaning is at the core of human cognition, state-of-
the-art distributional semantic models (DSMs) are often agnostic
to the findings in the area of semantic cognition. In this
work, we present a novel type of DSMs motivated by the dual–
processing cognitive perspective that is triggered by lexico–
semantic activations in the short–term human memory. The
proposed model is shown to perform better than state-of-the-art
models for computing semantic similarity between words. The
fusion of different types of DSMs is also investigated achieving
42
results that are comparable or better than the state-of-the-art. The
used corpora along with a set of tools, as well as large repositories
of vectorial word representations are made publicly available for
four languages (English, German, Italian, and Greek).
Extending Monolingual Semantic TextualSimilarity Task to Multiple Cross-lingual Settings
Yoshihiko Hayashi and Wentao Luo
This paper describes our independent effort for extending the
monolingual semantic textual similarity (STS) task setting to
multiple cross-lingual settings involving English, Japanese, and
Chinese. So far, we have adopted a “monolingual similarity after
translation” strategy to predict the semantic similarity between
a pair of sentences in different languages. With this strategy, a
monolingual similarity method is applied after having (one of) the
target sentences translated into a pivot language. Therefore, this
paper specifically details the required and developed resources to
implement this framework, while presenting our current results
for English-Japanese-Chinese cross-lingual STS tasks that may
exemplify the validity of the framework.
Resources for building applications withDependency Minimal Recursion Semantics
Ann Copestake, Guy Emerson, Michael Wayne Goodman,Matic Horvat, Alexander Kuhnle and Ewa Muszynska
We describe resources aimed at increasing the usability of the
semantic representations utilized within the DELPH-IN (Deep
Linguistic Processing with HPSG) consortium. We concentrate
in particular on the Dependency Minimal Recursion Semantics
(DMRS) formalism, a graph-based representation designed for
compositional semantic representation with deep grammars. Our
main focus is on English, and specifically English Resource
Semantics (ERS) as used in the English Resource Grammar.
We first give an introduction to ERS and DMRS and a brief
overview of some existing resources and then describe in detail
a new repository which has been developed to simplify the use of
ERS/DMRS. We explain a number of operations on DMRS graphs
which our repository supports, with sketches of the algorithms,
and illustrate how these operations can be exploited in application
building. We believe that this work will aid researchers to exploit
the rich and effective but complex DELPH-IN resources.
Subtask Mining from Search Query Logs forHow-Knowledge Acceleration
Chung-Lun Kuo and Hsin-Hsi Chen
How-knowledge is indispensable in daily life, but has relatively
less quantity and poorer quality than what-knowledge in publicly
available knowledge bases. This paper first extracts task-subtask
pairs from wikiHow, then mines linguistic patterns from search
query logs, and finally applies the mined patterns to extract
subtasks to complete given how-to tasks. To evaluate the
proposed methodology, we group tasks and the corresponding
recommended subtasks into pairs, and evaluate the results
automatically and manually. The automatic evaluation shows the
accuracy of 0.4494. We also classify the mined patterns based
on prepositions and find that the prepositions like “on”, “to”, and
“with”have the better performance. The results can be used to
accelerate how-knowledge base construction.
Typology of Adjectives Benchmark forCompositional Distributional Models
Daria Ryzhova, Maria Kyuseva and Denis Paperno
In this paper we present a novel application of compositional
distributional semantic models (CDSMs): prediction of lexical
typology. The paper introduces the notion of typological
closeness, which is a novel rigorous formalization of semantic
similarity based on comparison of multilingual data. Starting
from the Moscow Database of Qualitative Features for adjective
typology, we create four datasets of typological closeness, on
which we test a range of distributional semantic models. We
show that, on the one hand, vector representations of phrases
based on data from one language can be used to predict how
words within the phrase translate into different languages, and,
on the other hand, that typological data can serve as a semantic
benchmark for distributional models. We find that compositional
distributional models, especially parametric ones, perform way
above non-compositional alternatives on the task.
DART: a Dataset of Arguments and theirRelations on Twitter
Tom Bosc, Elena Cabrio and Serena Villata
The problem of understanding the stream of messages exchanged
on social media such as Facebook and Twitter is becoming a
major challenge for automated systems. The tremendous amount
of data exchanged on these platforms as well as the specific
form of language adopted by social media users constitute a new
challenging context for existing argument mining techniques. In
this paper, we describe a resource of natural language arguments
called DART (Dataset of Arguments and their Relations on
Twitter) where the complete argument mining pipeline over
Twitter messages is considered: (i) we identify which tweets can
be considered as arguments and which cannot, and (ii) we identify
43
what is the relation, i.e., support or attack, linking such tweets to
each other.
O13 - Large Projects and InfrastructuresWednesday, May 25, 18:10
Chairperson: Walter Daelemans Oral Session
Port4NooJ v3.0: Integrated Linguistic Resourcesfor Portuguese NLP
Cristina Mota, Paula Carvalho and Anabela Barreiro
This paper introduces Port4NooJ v3.0, the latest version of the
Portuguese module for NooJ, highlights its main features, and
details its three main new components: (i) a lexicon-grammar
based dictionary of 5,177 human intransitive adjectives, and a set
of local grammars that use the distributional properties of those
adjectives for paraphrasing (ii) a polarity dictionary with 9,031
entries for sentiment analysis, and (iii) a set of priority dictionaries
and local grammars for named entity recognition. These new
components were derived and/or adapted from publicly available
resources. The Port4NooJ v3.0 resource is innovative in terms of
the specificity of the linguistic knowledge it incorporates. The
dictionary is bilingual Portuguese-English, and the semantico-
syntactic information assigned to each entry validates the
linguistic relation between the terms in both languages. These
characteristics, which cannot be found in any other public resource
for Portuguese, make it a valuable resource for translation and
paraphrasing. The paper presents the current statistics and
describes the different complementary and synergic components
and integration efforts.
Collecting Language Resources for the Latviane-Government Machine Translation Platform
Roberts Rozis, Andrejs Vasiljevs and Raivis Skadinš
This paper describes corpora collection activity for building large
machine translation systems for Latvian e-Government platform.
We describe requirements for corpora, selection and assessment
of data sources, collection of the public corpora and creation
of new corpora from miscellaneous sources. Methodology,
tools and assessment methods are also presented along with the
results achieved, challenges faced and conclusions made. Several
approaches to address the data scarceness are discussed. We
summarize the volume of obtained corpora and provide quality
metrics of MT systems trained on this data. Resulting MT systems
for English-Latvian, Latvian English and Latvian Russian are
integrated in the Latvian e-service portal and are freely available
on website HUGO.LV. This paper can serve as a guidance for
similar activities initiated in other countries, particularly in the
context of European Language Resource Coordination action.
Nederlab: Towards a Single Portal and ResearchEnvironment for Diachronic Dutch Text CorporaHennie Brugman, Martin Reynaert, Nicoline van der Sijs,René van Stipriaan, Erik Tjong Kim Sang and Antal vanden Bosch
The Nederlab project aims to bring together all digitized texts
relevant to the Dutch national heritage, the history of the Dutch
language and culture (circa 800 – present) in one user friendly
and tool enriched open access web interface. This paper describes
Nederlab halfway through the project period and discusses the
collections incorporated, back-office processes, system back-end
as well as the Nederlab Research Portal end-user web application.
O14 - Document Classification and TextCategorisationWednesday, May 25, 18:10
Chairperson: Robert Frederking Oral Session
A Semi-Supervised Approach for GenderIdentificationJuan Soler and Leo Wanner
In most of the research studies on Author Profiling, large
quantities of correctly labeled data are used to train the models.
However, this does not reflect the reality in forensic scenarios:
in practical linguistic forensic investigations, the resources that
are available to profile the author of a text are usually scarce.
To pay tribute to this fact, we implemented a Semi-Supervised
Learning variant of the k nearest neighbors algorithm that uses
small sets of labeled data and a larger amount of unlabeled
data to classify the authors of texts by gender (man vs woman).
We describe the enriched KNN algorithm and show that the
use of unlabeled instances improves the accuracy of our gender
identification model. We also present a feature set that facilitates
the use of a very small number of instances, reaching accuracies
higher than 70% with only 113 instances to train the model. It is
also shown that the algorithm also performs well using publicly
available data.
Ensemble Classification of Grants usingLDA-based FeaturesYannis Korkontzelos, Beverley Thomas, Makoto Miwa andSophia Ananiadou
Classifying research grants into useful categories is a vital
task for a funding body to give structure to the portfolio
for analysis, informing strategic planning and decision-making.
Automating this classification process would save time and effort,
providing the accuracy of the classifications is maintained. We
44
employ five classification models to classify a set of BBSRC-
funded research grants in 21 research topics based on unigrams,
technical terms and Latent Dirichlet Allocation models. To
boost precision, we investigate methods for combining their
predictions into five aggregate classifiers. Evaluation confirmed
that ensemble classification models lead to higher precision.It
was observed that there is not a single best-performing aggregate
method for all research topics. Instead, the best-performing
method for a research topic depends on the number of positive
training instances available for this topic. Subject matter
experts considered the predictions of aggregate models to correct
erroneous or incomplete manual assignments.
Edit Categories and Editor Role Identification inWikipedia
Diyi Yang, Aaron Halfaker, Robert Kraut and Eduard Hovy
In this work, we introduced a corpus for categorizing edit types in
Wikipedia. This fine-grained taxonomy of edit types enables us
to differentiate editing actions and find editor roles in Wikipedia
based on their low-level edit types. To do this, we first created an
annotated corpus based on 1,996 edits obtained from 953 article
revisions and built machine-learning models to automatically
identify the edit categories associated with edits. Building on
this automated measurement of edit types, we then applied a
graphical model analogous to Latent Dirichlet Allocation to
uncover the latent roles in editors’ edit histories. Applying this
technique revealed eight different roles editors play, such as Social
Networker, Substantive Expert, etc.
O15 - Morphology (1)Wednesday, May 25, 18:10
Chairperson: Tamás Váradi Oral Session
Morphologically Annotated Corpora andMorphological Analyzers for Moroccan andSanaani Yemeni Arabic
Faisal Al shargi, Aidan Kaplan, Ramy Eskander, NizarHabash and Owen Rambow
We present new language resources for Moroccan and Sanaani
Yemeni Arabic. The resources include corpora for each dialect
which have been morphologically annotated, and morphological
analyzers for each dialect which are derived from these corpora.
These are the first sets of resources for Moroccan and Yemeni
Arabic. The resources will be made available to the public.
Merging Data Resources for Inflectional andDerivational Morphology in Czech
Zdenek Žabokrtský, Magda Sevcikova, Milan Straka, JonášVidra and Adéla Limburská
The paper deals with merging two complementary resources of
morphological data previously existing for Czech, namely the
inflectional dictionary MorfFlex CZ and the recently developed
lexical network DeriNet. The MorfFlex CZ dictionary has been
used by a morphological analyzer capable of analyzing/generating
several million Czech word forms according to the rules of
Czech inflection. The DeriNet network contains several hundred
thousand Czech lemmas interconnected with links corresponding
to derivational relations (relations between base words and words
derived from them). After summarizing basic characteristics of
both resources, the process of merging is described, focusing on
both rather technical aspects (growth of the data, measuring the
quality of newly added derivational relations) and linguistic issues
(treating lexical homonymy and vowel/consonant alternations).
The resulting resource contains 970 thousand lemmas connected
with 715 thousand derivational relations and is publicly available
on the web under the CC-BY-NC-SA license. The data
were incorporated in the MorphoDiTa library version 2.0
(which provides morphological analysis, generation, tagging and
lemmatization for Czech) and can be browsed and searched by
two web tools (DeriNet Viewer and DeriNet Search tool).
A New Integrated Open-source MorphologicalAnalyzer for Hungarian
Attila Novák, Borbála Siklósi and Csaba Oravecz
The goal of a Hungarian research project has been to create
an integrated Hungarian natural language processing framework.
This infrastructure includes tools for analyzing Hungarian texts,
integrated into a standardized environment. The morphological
analyzer is one of the core components of the framework. The goal
of this paper is to describe a fast and customizable morphological
analyzer and its development framework, which synthesizes and
further enriches the morphological knowledge implemented in
previous tools existing for Hungarian. In addition, we present
the method we applied to add semantic knowledge to the lexical
database of the morphology. The method utilizes neural word
embedding models and morphological and shallow syntactic
knowledge.
45
O16 - Phonetics and ProsodyWednesday, May 25, 18:10
Chairperson: Dafydd Gibbon Oral Session
New release of Mixer-6: Improved validity forphonetic study of speaker variation andidentification
Eleanor Chodroff, Matthew Maciejewski, Jan Trmal,Sanjeev Khudanpur and John Godfrey
The Mixer series of speech corpora were collected over several
years, principally to support annual NIST evaluations of speaker
recognition (SR) technologies. These evaluations focused on
conversational speech over a variety of channels and recording
conditions. One of the series, Mixer-6, added a new condition,
read speech, to support basic scientific research on speaker
characteristics, as well as technology evaluation. With read speech
it is possible to make relatively precise measurements of phonetic
events and features, which can be correlated with the performance
of speaker recognition algorithms, or directly used in phonetic
analysis of speaker variability. The read speech, as originally
recorded, was adequate for large-scale evaluations (e.g., fixed-text
speaker ID algorithms) but only marginally suitable for acoustic-
phonetic studies. Numerous errors due largely to speaker behavior
remained in the corpus, with no record of their locations or rate
of occurrence. We undertook the effort to correct this situation
with automatic methods supplemented by human listening and
annotation. The present paper describes the tools and methods,
resulting corrections, and some examples of the kinds of research
studies enabled by these enhancements.
Assessing the Prosody of Non-Native Speakers ofEnglish: Measures and Feature Sets
Eduardo Coutinho, Florian Hönig, Yue Zhang, SimoneHantke, Anton Batliner, Elmar Nöth and Björn Schuller
In this paper, we describe a new database with audio recordings of
non-native (L2) speakers of English, and the perceptual evaluation
experiment conducted with native English speakers for assessing
the prosody of each recording. These annotations are then used
to compute the gold standard using different methods, and a
series of regression experiments is conducted to evaluate their
impact on the performance of a regression model predicting
the degree of naturalness of L2 speech. Further, we compare
the relevance of different feature groups modelling prosody in
general (without speech tempo), speech rate and pauses modelling
speech tempo (fluency), voice quality, and a variety of spectral
features. We also discuss the impact of various fusion strategies
on performance.Overall, our results demonstrate that the prosody
of non-native speakers of English as L2 can be reliably assessed
using supra-segmental audio features; prosodic features seem to
be the most important ones.
The IFCASL Corpus of French and GermanNon-native and Native Read Speech
Juergen Trouvain, Anne Bonneau, Vincent Colotte, CamilleFauth, Dominique Fohr, Denis Jouvet, Jeanin Jügler, YvesLaprie, Odile Mella, Bernd Möbius and Frank Zimmerer
The IFCASL corpus is a French-German bilingual phonetic
learner corpus designed, recorded and annotated in a project on
individualized feedback in computer-assisted spoken language
learning. The motivation for setting up this corpus was that
there is no phonetically annotated and segmented corpus for this
language pair of comparable of size and coverage. In contrast to
most learner corpora, the IFCASL corpus incorporate data for a
language pair in both directions, i.e. in our case French learners
of German, and German learners of French. In addition, the
corpus is complemented by two sub-corpora of native speech by
the same speakers. The corpus provides spoken data by about 100
speakers with comparable productions, annotated and segmented
on the word and the phone level, with more than 50% manually
corrected data. The paper reports on inter-annotator agreement
and the optimization of the acoustic models for forced speech-
text alignment in exercises for computer-assisted pronunciation
training. Example studies based on the corpus data with a phonetic
focus include topics such as the realization of /h/ and glottal stop,
final devoicing of obstruents, vowel quantity and quality, pitch
range, and tempo.
P14 - Lexical DatabasesWednesday, May 25, 18:10 - 19:10
Chairperson: Amália Mendes Poster Session
LELIO: An Auto-Adaptative System to AcquireDomain Lexical Knowledge in Technical Texts
Patrick Saint-Dizier
In this paper, we investigate some language acquisition facets of
an auto-adaptative system that can automatically acquire most
of the relevant lexical knowledge and authoring practices for
an application in a given domain. This is the LELIO project:
producing customized LELIE solutions. Our goal, within the
framework of LELIE (a system that tags language uses that do
not follow the Constrained Natural Language principles), is to
automate the long, costly and error prone lexical customization
of LELIE to a given application domain. Technical texts
being relatively restricted in terms of syntax and lexicon, results
46
obtained show that this approach is feasible and relatively reliable.
By auto-adaptative, we mean that the system learns from a sample
of the application corpus the various lexical terms and uses crucial
for LELIE to work properly (e.g. verb uses, fuzzy terms, business
terms, stylistic patterns). A technical writer validation method is
developed at each step of the acquisition.
Wikification for Scriptio Continua
Yugo Murawaki and Shinsuke Mori
The fact that Japanese employs scriptio continua, or a writing
system without spaces, complicates the first step of an NLP
pipeline. Word segmentation is widely used in Japanese
language processing, and lexical knowledge is crucial for reliable
identification of words in text. Although external lexical resources
like Wikipedia are potentially useful, segmentation mismatch
prevents them from being straightforwardly incorporated into the
word segmentation task. If we intentionally violate segmentation
standards with the direct incorporation, quantitative evaluation
will be no longer feasible. To address this problem, we propose
to define a separate task that directly links given texts to an
external resource, that is, wikification in the case of Wikipedia. By
doing so, we can circumvent segmentation mismatch that may not
necessarily be important for downstream applications. As the first
step to realize the idea, we design the task of Japanese wikification
and construct wikification corpora. We annotated subsets of
the Balanced Corpus of Contemporary Written Japanese plus
Twitter short messages. We also implement a simple wikifier and
investigate its performance on these corpora.
Accessing and Elaborating Walenty – a ValenceDictionary of Polish – via Internet Browser
Bartłomiej Niton, Tomasz Bartosiak and Elzbieta Hajnicz
This article presents Walenty - a new valence dictionary of Polish
predicates, concentrating on its creation process and access via
Internet browser. The dictionary contains two layers, syntactic
and semantic. The syntactic layer describes syntactic and
morphosyntactic constraints predicates put on their dependants.
The semantic layer shows how predicates and their arguments
are involved in a situation described in an utterance. These two
layers are connected, representing how semantic arguments can
be realised on the surface. Walenty also contains a powerful
phraseological (idiomatic) component. Walenty has been created
and can be accessed remotely with a dedicated tool called Slowal.
In this article, we focus on most important functionalities of this
system. First, we will depict how to access the dictionary and
how built-in filtering system (covering both syntactic and semantic
phenomena) works. Later, we will describe the process of creating
dictionary by Slowal tool that both supports and controls the work
of lexicographers.
CEPLEXicon – A Lexicon of Child EuropeanPortugueseAna Lúcia Santos, Maria João Freitas and Aida Cardoso
CEPLEXicon (version 1.1) is a child lexicon resulting from
the automatic tagging of two child corpora: the corpus Santos
(Santos, 2006; Santos et al. 2014) and the corpus Child
– Adult Interaction (Freitas et al. 2012), which integrates
information from the corpus Freitas (Freitas, 1997). This
lexicon includes spontaneous speech produced by seven children
(1;02.00 to 3;11.12) during approximately 86h of child-adult
interaction. The automatic tagging comprised the lemmatization
and morphosyntactic classification of the speech produced by
the seven children included in the two child corpora; the
lexicon contains information pertaining to lemmas and syntactic
categories as well as absolute number of occurrences and
frequencies in three age intervals: < 2 years; 2 years and < 3 years;
3 years. The information included in this lexicon and the format in
which it is presented enables research in different areas and allows
researchers to obtain measures of lexical growth. CEPLEXicon is
available through the ELRA catalogue.
Extracting Weighted Language Lexicons fromWikipediaGregory Grefenstette
Language models are used in applications as diverse as
speech recognition, optical character recognition and information
retrieval. They are used to predict word appearance, and to
weight the importance of words in these applications. One basic
element of language models is the list of words in a language.
Another is the unigram frequency of each word. But this basic
information is not available for most languages in the world. Since
the multilingual Wikipedia project encourages the production of
encyclopedic-like articles in many world languages, we can find
there an ever-growing source of text from which to extract these
two language modelling elements: word list and frequency. Here
we present a simple technique for converting this Wikipedia
text into lexicons of weighted unigrams for the more than 280
languages present currently present in Wikipedia. The lexicons
produced, and the source code for producing them in a Linux-
based system are here made available for free on the Web.
Wiktionnaire’s Wikicode GLAWIfied: a WorkableFrench Machine-Readable DictionaryNabil Hathout and Franck Sajous
GLAWI is a free, large-scale and versatile Machine-Readable
Dictionary (MRD) that has been extracted from the French
language edition of Wiktionary, called Wiktionnaire. In
(Sajous and Hathout, 2015), we introduced GLAWI, gave the
47
rationale behind the creation of this lexicographic resource and
described the extraction process, focusing on the conversion
and standardization of the heterogeneous data provided by this
collaborative dictionary. In the current article, we describe the
content of GLAWI and illustrate how it is structured. We also
suggest various applications, ranging from linguistic studies, NLP
applications to psycholinguistic experimentation. They all can
take advantage of the diversity of the lexical knowledge available
in GLAWI. Besides this diversity and extensive lexical coverage,
GLAWI is also remarkable because it is the only free lexical
resource of contemporary French that contains definitions. This
unique material opens way to the renewal of MRD-based methods,
notably the automated extraction and acquisition of semantic
relations.
P15 - MultimodalityWednesday, May 25, 18:10 - 19:10
Chairperson: Carlo Strapparava Poster Session
A Corpus of Images and Text in Online News
Laura Hollink, Adriatik Bedjeti, Martin van Harmelen andDesmond Elliott
In recent years, several datasets have been released that include
images and text, giving impulse to new methods that combine
natural language processing and computer vision. However, there
is a need for datasets of images in their natural textual context.
The ION corpus contains 300K news articles published between
August 2014 - 2015 in five online newspapers from two countries.
The 1-year coverage over multiple publishers ensures a broad
scope in terms of topics, image quality and editorial viewpoints.
The corpus consists of JSON-LD files with the following data
about each article: the original URL of the article on the news
publisher’s website, the date of publication, the headline of the
article, the URL of the image displayed with the article (if
any), and the caption of that image. Neither the article text
nor the images themselves are included in the corpus. Instead,
the images are distributed as high-dimensional feature vectors
extracted from a Convolutional Neural Network, anticipating their
use in computer vision tasks. The article text is represented as a
list of automatically generated entity and topic annotations in the
form of Wikipedia/DBpedia pages. This facilitates the selection
of subsets of the corpus for separate analysis or evaluation.
BosphorusSign: A Turkish Sign LanguageRecognition Corpus in Health and FinanceDomainsNecati Cihan Camgöz, Ahmet Alp Kındıroglu, SerpilKarabüklü, Meltem Kelepir, Ayse Sumru Özsoy and LaleAkarun
There are as many sign languages as there are deaf communities
in the world. Linguists have been collecting corpora of different
sign languages and annotating them extensively in order to study
and understand their properties. On the other hand, the field of
computer vision has approached the sign language recognition
problem as a grand challenge and research efforts have intensified
in the last 20 years. However, corpora collected for studying
linguistic properties are often not suitable for sign language
recognition as the statistical methods used in the field require large
amounts of data. Recently, with the availability of inexpensive
depth cameras, groups from the computer vision community have
started collecting corpora with large number of repetitions for
sign language recognition research. In this paper, we present the
BosphorusSign Turkish Sign Language corpus, which consists of
855 sign and phrase samples from the health, finance and everyday
life domains. The corpus is collected using the state-of-the-art
Microsoft Kinect v2 depth sensor, and will be the first in this sign
language research field. Furthermore, there will be annotations
rendered by linguists so that the corpus will appeal both to the
linguistic and sign language recognition research communities.
The CIRDO Corpus: Comprehensive Audio/VideoDatabase of Domestic Falls of Elderly PeopleMichel Vacher, Saïda Bouakaz, Marc-Eric BobillierChaumon, Frédéric Aman, R. A. Khan, Slima Bekkadja,François Portet, Erwan Guillou, Solange Rossato andBenjamin Lecouteux
Ambient Assisted Living aims at enhancing the quality of life of
older and disabled people at home thanks to Smart Homes. In
particular, regarding elderly living alone at home, the detection of
distress situation after a fall is very important to reassure this kind
of population. However, many studies do not include tests in real
settings, because data collection in this domain is very expensive
and challenging and because of the few available data sets. The
C IRDO corpus is a dataset recorded in realistic conditions in
D OMUS , a fully equipped Smart Home with microphones
and home automation sensors, in which participants performed
scenarios including real falls on a carpet and calls for help. These
scenarios were elaborated thanks to a field study involving elderly
persons. Experiments related in a first part to distress detection in
real-time using audio and speech analysis and in a second part to
fall detection using video analysis are presented. Results show the
difficulty of the task. The database can be used as standardized
48
database by researchers to evaluate and compare their systems for
elderly person’s assistance.
Semi-automatically Alignment of Predicatesbetween Speech and OntoNotes data
Niraj Shrestha and Marie-Francine Moens
Speech data currently receives a growing attention and is an
important source of information. We still lack suitable corpora of
transcribed speech annotated with semantic roles that can be used
for semantic role labeling (SRL), which is not the case for written
data. Semantic role labeling in speech data is a challenging and
complex task due to the lack of sentence boundaries and the many
transcription errors such as insertion, deletion and misspellings
of words. In written data, SRL evaluation is performed at
the sentence level, but in speech data sentence boundaries
identification is still a bottleneck which makes evaluation more
complex. In this work, we semi-automatically align the predicates
found in transcribed speech obtained with an automatic speech
recognizer (ASR) with the predicates found in the corresponding
written documents of the OntoNotes corpus and manually align
the semantic roles of these predicates thus obtaining annotated
semantic frames in the speech data. This data can serve as gold
standard alignments for future research in semantic role labeling
of speech data.
CORILSE: a Spanish Sign Language Repositoryfor Linguistic Analysis
María del Carmen Cabeza-Pereiro, José M. García-Miguel, Carmen García Mateo and José Luis Alba Castro
CORILSE is a computerized corpus of Spanish Sign Language
(Lengua de Signos Española, LSE). It consists of a set of
recordings from different discourse genres by Galician signers
living in the city of Vigo. In this paper we describe its annotation
system, developed on the basis of pre-existing ones (mostly the
model of Auslan corpus). This includes primary annotation of id-
glosses for manual signs, annotation of non-manual component,
and secondary annotation of grammatical categories and relations,
because this corpus is been built for grammatical analysis, in
particular argument structures in LSE. Up until this moment the
annotation has been basically made by hand, which is a slow and
time-consuming task. The need to facilitate this process leads us
to engage in the development of automatic or semi-automatic tools
for manual and facial recognition. Finally, we also present the web
repository that will make the corpus available to different types of
users, and will allow its exploitation for research purposes and
other applications (e.g. teaching of LSE or design of tasks for
signed language assessment).
The OFAI Multi-Modal Task Description Corpus
Stephanie Schreitter and Brigitte Krenn
The OFAI Multimodal Task Description Corpus (OFAI-MMTD
Corpus) is a collection of dyadic teacher-learner (human-human
and human-robot) interactions. The corpus is multimodal
and tracks the communication signals exchanged between
interlocutors in task-oriented scenarios including speech, gaze and
gestures. The focus of interest lies on the communicative signals
conveyed by the teacher and which objects are salient at which
time. Data are collected from four different task description setups
which involve spatial utterances, navigation instructions and more
complex descriptions of joint tasks.
A Japanese Chess Commentary Corpus
Shinsuke Mori, John Richardson, Atsushi Ushiku, TetsuroSasada, Hirotaka Kameko and Yoshimasa Tsuruoka
In recent years there has been a surge of interest in the natural
language prosessing related to the real world, such as symbol
grounding, language generation, and nonlinguistic data search by
natural language queries. In order to concentrate on language
ambiguities, we propose to use a well-defined “real world”, that
is game states. We built a corpus consisting of pairs of sentences
and a game state. The game we focus on is shogi (Japanese
chess). We collected 742,286 commentary sentences in Japanese.
They are spontaneously generated contrary to natural language
annotations in many image datasets provided by human workers
on Amazon Mechanical Turk. We defined domain specific named
entities and we segmented 2,508 sentences into words manually
and annotated each word with a named entity tag. We describe a
detailed definition of named entities and show some statistics of
our game commentary corpus. We also show the results of the
experiments of word segmentation and named entity recognition.
The accuracies are as high as those on general domain texts
indicating that we are ready to tackle various new problems related
to the real world.
The CAMOMILE Collaborative AnnotationPlatform for Multi-modal, Multi-lingual andMulti-media Documents
Johann Poignant, Mateusz Budnik, Hervé Bredin, ClaudeBarras, Mickael Stefas, Pierrick Bruneau, Gilles Adda,Laurent Besacier, Hazim Ekenel, Gil Francopoulo, JavierHernando, Joseph Mariani, Ramon Morros, GeorgesQuénot, Sophie Rosset and Thomas Tamisier
In this paper, we describe the organization and the implementation
of the CAMOMILE collaborative annotation framework for
49
multimodal, multimedia, multilingual (3M) data. Given the
versatile nature of the analysis which can be performed on 3M
data, the structure of the server was kept intentionally simple
in order to preserve its genericity, relying on standard Web
technologies. Layers of annotations, defined as data associated
to a media fragment from the corpus, are stored in a database and
can be managed through standard interfaces with authentication.
Interfaces tailored specifically to the needed task can then be
developed in an agile way, relying on simple but reliable services
for the management of the centralized annotations. We then
present our implementation of an active learning scenario for
person annotation in video, relying on the CAMOMILE server;
during a dry run experiment, the manual annotation of 716 speech
segments was thus propagated to 3504 labeled tracks. The code of
the CAMOMILE framework is distributed in open source.
Finding Recurrent Features of Image SchemaGestures: the FIGURE corpus
Andy Luecking, Alexander Mehler, Désirée Walther, MarcelMauri and Dennis Kurfürst
The Frankfurt Image GestURE corpus (FIGURE) is introduced.
The corpus data is collected in an experimental setting where 50
naive participants spontaneously produced gestures in response
to five to six terms from a total of 27 stimulus terms. The
stimulus terms have been compiled mainly from image schemata
from psycholinguistics, since such schemata provide a panoply
of abstract contents derived from natural language use. The
gestures have been annotated for kinetic features. FIGURE
aims at finding (sets of) stable kinetic feature configurations
associated with the stimulus terms. Given such configurations,
they can be used for designing HCI gestures that go beyond pre-
defined gesture vocabularies or touchpad gestures. It is found,
for instance, that movement trajectories are far more informative
than handshapes, speaking against purely handshape-based HCI
vocabularies. Furthermore, the mean temporal duration of hand
and arm movements associated vary with the stimulus terms,
indicating a dynamic dimension not covered by vocabulary-based
approaches. Descriptive results are presented and related to
findings from gesture studies and natural language dialogue.
An Interaction-Centric Dataset for LearningAutomation Rules in Smart Homes
Kai Frederic Engelmann, Patrick Holthaus, Britta Wredeand Sebastian Wrede
The term smart home refers to a living environment that by
its connected sensors and actuators is capable of providing
intelligent and contextualised support to its user. This may
result in automated behaviors that blends into the user’s daily
life. However, currently most smart homes do not provide
such intelligent support. A first step towards such intelligent
capabilities lies in learning automation rules by observing the
user’s behavior. We present a new type of corpus for learning
such rules from user behavior as observed from the events in
a smart homes sensor and actuator network. The data contains
information about intended tasks by the users and synchronized
events from this network. It is derived from interactions of 59
users with the smart home in order to solve five tasks. The corpus
contains recordings of more than 40 different types of data streams
and has been segmented and pre-processed to increase signal
quality. Overall, the data shows a high noise level on specific data
types that can be filtered out by a simple smoothing approach. The
resulting data provides insights into event patterns resulting from
task specific user behavior and thus constitutes a basis for machine
learning approaches to learn automation rules.
A Web Tool for Building Parallel Corpora ofSpoken and Sign Languages
Alex Becker, Fabio Kepler and Sara Candeias
In this paper we describe our work in building an online tool
for manually annotating texts in any spoken language with
SignWriting in any sign language. The existence of such tool will
allow the creation of parallel corpora between spoken and sign
languages that can be used to bootstrap the creation of efficient
tools for the Deaf community. As an example, a parallel corpus
between English and American Sign Language could be used
for training Machine Learning models for automatic translation
between the two languages. Clearly, this kind of tool must be
designed in a way that it eases the task of human annotators, not
only by being easy to use, but also by giving smart suggestions
as the annotation progresses, in order to save time and effort.
By building a collaborative, online, easy to use annotation tool
for building parallel corpora between spoken and sign languages
we aim at helping the development of proper resources for
sign languages that can then be used in state-of-the-art models
currently used in tools for spoken languages. There are several
issues and difficulties in creating this kind of resource, and our
presented tool already deals with some of them, like adequate text
representation of a sign and many to many alignments between
words and signs.
50
P16 - OntologiesWednesday, May 25, 18:10 - 19:10
Chairperson: Elena Montiel Ponsoda Poster Session
Issues and Challenges in Annotating Urdu ActionVerbs on the IMAGACT4ALL Platform
Sharmin Muzaffar, Pitambar Behera and Girish Jha
In South-Asian languages such as Hindi and Urdu, action verbs
having compound constructions and serial verbs constructions
pose serious problems for natural language processing and
other linguistic tasks. Urdu is an Indo-Aryan language
spoken by 51, 500, 0001 speakers in India. Action verbs
that occur spontaneously in day-to-day communication are
highly ambiguous in nature semantically and as a consequence
cause disambiguation issues that are relevant and applicable
to Language Technologies (LT) like Machine Translation (MT)
and Natural Language Processing (NLP). IMAGACT4ALL is an
ontology-driven web-based platform developed by the University
of Florence for storing action verbs and their inter-relations.
This group is currently collaborating with Jawaharlal Nehru
University (JNU) in India to connect Indian languages on this
platform. Action verbs are frequently used in both written and
spoken discourses and refer to various meanings because of their
polysemic nature. The IMAGACT4ALL platform stores each
3d animation image, each one of them referring to a variety of
possible ontological types, which in turn makes the annotation
task for the annotator quite challenging with regard to selecting
verb argument structure having a range of probability distribution.
The authors, in this paper, discuss the issues and challenges
such as complex predicates (compound and conjunct verbs),
ambiguously animated video illustrations, semantic discrepancies,
and the factors of verb-selection preferences that have produced
significant problems in annotating Urdu verbs on the IMAGACT
ontology.
Domain Ontology Learning Enhanced byOptimized Relation Instance in DBpedia
Liumingjing Xiao, Chong Ruan, An Yang, Junhao Zhangand Junfeng Hu
Ontologies are powerful to support semantic based applications
and intelligent systems. While ontology learning are challenging
due to its bottleneck in handcrafting structured knowledge sources
and training data. To address this difficulty, many researchers turn
to ontology enrichment and population using external knowledge
sources such as DBpedia. In this paper, we propose a method
using DBpedia in a different manner. We utilize relation instances
in DBpedia to supervise the ontology learning procedure from
unstructured text, rather than populate the ontology structure as
a post-processing step. We construct three language resources
in areas of computer science: enriched Wikipedia concept tree,
domain ontology, and gold standard from NSFC taxonomy.
Experiment shows that the result of ontology learning from corpus
of computer science can be improved via the relation instances
extracted from DBpedia in the same field. Furthermore, making
distinction between the relation instances and applying a proper
weighting scheme in the learning procedure lead to even better
result.
Constructing a Norwegian Academic WordlistJanne M Johannessen, Arash Saidi and Kristin Hagen
We present the development of a Norwegian Academic Wordlist
(AKA list) for the Norwegian Bokmål variety. To identify
specific academic vocabulary we developed a 100-million-word
academic corpus based on the University of Oslo archive of digital
publications. Other corpora were used for testing and developing
general word lists. We tried two different methods, those of
Carlund et al. (2012) and Gardner & Davies (2013), and compared
them. The resulting list is presented on a web site, where the
words can be inspected in different ways, and freely downloaded.
The Event and Implied Situation Ontology (ESO):Application and EvaluationRoxane Segers, Marco Rospocher, Piek Vossen, EgoitzLaparra, German Rigau and Anne-Lyse Minard
This paper presents the Event and Implied Situation Ontology
(ESO), a manually constructed resource which formalizes the
pre and post situations of events and the roles of the entities
affected by an event. The ontology is built on top of existing
resources such as WordNet, SUMO and FrameNet. The ontology
is injected to the Predicate Matrix, a resource that integrates
predicate and role information from amongst others FrameNet,
VerbNet, PropBank, NomBank and WordNet. We illustrate how
these resources are used on large document collections to detect
information that otherwise would have remained implicit. The
ontology is evaluated on two aspects: recall and precision based
on a manually annotated corpus and secondly, on the quality of
the knowledge inferred by the situation assertions in the ontology.
Evaluation results on the quality of the system show that 50% of
the events typed and enriched with ESO assertions are correct.
Combining Ontologies and Neural Networks forAnalyzing Historical Language Varieties. A CaseStudy in Middle Low GermanMaria Sukhareva and Christian Chiarcos
In this paper, we describe experiments on the morphosyntactic
annotation of historical language varieties for the example of
Middle Low German (MLG), the official language of the German
51
Hanse during the Middle Ages and a dominant language around
the Baltic Sea by the time. To our best knowledge, this is
the first experiment in automatically producing morphosyntactic
annotations for Middle Low German, and accordingly, no part-of-
speech (POS) tagset is currently agreed upon. In our experiment,
we illustrate how ontology-based specifications of projected
annotations can be employed to circumvent this issue: Instead
of training and evaluating against a given tagset, we decomponse
it into independent features which are predicted independently
by a neural network. Using consistency constraints (axioms)
from an ontology, then, the predicted feature probabilities are
decoded into a sound ontological representation. Using these
representations, we can finally bootstrap a POS tagset capturing
only morphosyntactic features which could be reliably predicted.
In this way, our approach is capable to optimize precision
and recall of morphosyntactic annotations simultaneously with
bootstrapping a tagset rather than performing iterative cycles.
Ecological Gestures for HRI: the GEE Corpus
Maxence Girard-Rivier, Romain Magnani, VeroniqueAuberge, Yuko Sasa, Liliya Tsvetanova, Frederic Aman andClarisse Bayol
As part of a human-robot interaction project, we are interested
by gestural modality as one of many ways to communicate. In
order to develop a relevant gesture recognition system associated
to a smart home butler robot. Our methodology is based on an
IQ game-like Wizard of Oz experiment to collect spontaneous and
implicitly produced gestures in an ecological context. During the
experiment, the subject has to use non-verbal cues (i.e. gestures)
to interact with a robot that is the referee. The subject is unaware
that his gestures will be the focus of our study. In the second
part of the experiment, we asked the subjects to do the gestures
he had produced in the experiment, those are the explicit gestures.
The implicit gestures are compared with explicitly produced ones
to determine a relevant ontology. This preliminary qualitative
analysis will be the base to build a big data corpus in order to
optimize acceptance of the gesture dictionary in coherence with
the “socio-affective glue” dynamics.
A Taxonomy of Spanish Nouns, a StatisticalAlgorithm to Generate it and its Implementationin Open Source Code
Rogelio Nazar and Irene Renau
In this paper we describe our work in progress in the automatic
development of a taxonomy of Spanish nouns, we offer the Perl
implementation we have so far, and we discuss the different
problems that still need to be addressed. We designed a
statistically-based taxonomy induction algorithm consisting of a
combination of different strategies not involving explicit linguistic
knowledge. Being all quantitative, the strategies we present
are however of different nature. Some of them are based on
the computation of distributional similarity coefficients which
identify pairs of sibling words or co-hyponyms, while others are
based on asymmetric co-occurrence and identify pairs of parent-
child words or hypernym-hyponym relations. A decision making
process is then applied to combine the results of the previous steps,
and finally connect lexical units to a basic structure containing the
most general categories of the language. We evaluate the quality
of the taxonomy both manually and also using Spanish Wordnet as
a gold-standard. We estimate an average of 89.07% precision and
25.49% recall considering only the results which the algorithm
presents with high degree of certainty, or 77.86% precision and
33.72% recall considering all results.
P17 - Part of Speech Tagging (1)Wednesday, May 25, 18:10 - 19:10
Chairperson: Krister Linden Poster Session
FOLK-Gold – A Gold Standard forPart-of-Speech-Tagging of Spoken German
Swantje Westpfahl and Thomas Schmidt
In this paper, we present a GOLD standard of part-of-
speech tagged transcripts of spoken German. The GOLD
standard data consists of four annotation layers – transcription
(modified orthography), normalization (standard orthography),
lemmatization and POS tags – all of which have undergone careful
manual quality control. It comes with guidelines for the manual
POS annotation of transcripts of German spoken data and an
extended version of the STTS (Stuttgart Tübingen Tagset) which
accounts for phenomena typically found in spontaneous spoken
German. The GOLD standard was developed on the basis of the
Research and Teaching Corpus of Spoken German, FOLK, and is,
to our knowledge, the first such dataset based on a wide variety of
spontaneous and authentic interaction types. It can be used as a
basis for further development of language technology and corpus
linguistic applications for German spoken language.
Fast and Robust POS tagger for Arabic TweetsUsing Agreement-based Bootstrapping
Fahad Albogamy and Allan Ramsay
Part-of-Speech(POS) tagging is a key step in many NLP
algorithms. However, tweets are difficult to POS tag because
they are short, are not always written maintaining formal grammar
and proper spelling, and abbreviations are often used to overcome
52
their restricted lengths. Arabic tweets also show a further range
of linguistic phenomena such as usage of different dialects,
romanised Arabic and borrowing foreign words. In this paper,
we present an evaluation and a detailed error analysis of state-
of-the-art POS taggers for Arabic when applied to Arabic tweets.
On the basis of this analysis, we combine normalisation and
external knowledge to handle the domain noisiness and exploit
bootstrapping to construct extra training data in order to improve
POS tagging for Arabic tweets. Our results show significant
improvements over the performance of a number of well-known
taggers for Arabic.
Lemmatization and Morphological Tagging inGerman and Latin: A Comparison and a Surveyof the State-of-the-art
Steffen Eger, Rüdiger Gleim and Alexander Mehler
This paper relates to the challenge of morphological tagging
and lemmatization in morphologically rich languages by example
of German and Latin. We focus on the question what a
practitioner can expect when using state-of-the-art solutions out
of the box. Moreover, we contrast these with old(er) methods and
implementations for POS tagging. We examine to what degree
recent efforts in tagger development are reflected by improved
accuracies – and at what cost, in terms of training and processing
time. We also conduct in-domain vs. out-domain evaluation.
Out-domain evaluations are particularly insightful because the
distribution of the data which is being tagged by a user will
typically differ from the distribution on which the tagger has
been trained. Furthermore, two lemmatization techniques are
evaluated. Finally, we compare pipeline tagging vs. a tagging
approach that acknowledges dependencies between inflectional
categories.
TLT-CRF: A Lexicon-supported MorphologicalTagger for Latin Based on Conditional RandomFields
Tim vor der Brück and Alexander Mehler
We present a morphological tagger for Latin, called TTLab Latin
Tagger based on Conditional Random Fields (TLT-CRF) which
uses a large Latin lexicon. Beyond Part of Speech (PoS), TLT-
CRF tags eight inflectional categories of verbs, adjectives or
nouns. It utilizes a statistical model based on CRFs together with
a rule interpreter that addresses scenarios of sparse training data.
We present results of evaluating TLT-CRF to answer the question
what can be learnt following the paradigm of 1st order CRFs in
conjunction with a large lexical resource and a rule interpreter.
Furthermore, we investigate the contigency of representational
features and targeted parts of speech to learn about selective
features.
Cross-lingual and Supervised Models forMorphosyntactic Annotation: a Comparison onRomanian
Lauriane Aufrant, Guillaume Wisniewski and FrançoisYvon
Because of the small size of Romanian corpora, the performance
of a PoS tagger or a dependency parser trained with the standard
supervised methods fall far short from the performance achieved
in most languages. That is why, we apply state-of-the-art methods
for cross-lingual transfer on Romanian tagging and parsing,
from English and several Romance languages. We compare
the performance with monolingual systems trained with sets of
different sizes and establish that training on a few sentences in
target language yields better results than transferring from large
datasets in other languages.
Corpus vs. Lexicon Supervision inMorphosyntactic Tagging: the Case of Slovene
Nikola Ljubešic and Tomaž Erjavec
In this paper we present a tagger developed for inflectionally rich
languages for which both a training corpus and a lexicon are
available. We do not constrain the tagger by the lexicon entries,
allowing both for lexicon incompleteness and noisiness. By using
the lexicon indirectly through features we allow for known and
unknown words to be tagged in the same manner. We test our
tagger on Slovene data, obtaining a 25% error reduction of the
best previous results both on known and unknown words. Given
that Slovene is, in comparison to some other Slavic languages, a
well-resourced language, we perform experiments on the impact
of token (corpus) vs. type (lexicon) supervision, obtaining useful
insights in how to balance the effort of extending resources to
yield better tagging results.
P18 - Treebanks (1)Wednesday, May 25, 18:10 - 19:10
Chairperson: Béatrice Daille Poster Session
Challenges and Solutions for ConsistentAnnotation of Vietnamese Treebank
Quy Nguyen, Yusuke Miyao, Ha Le and Ngan Nguyen
Treebanks are important resources for researchers in natural
language processing, speech recognition, theoretical linguistics,
etc. To strengthen the automatic processing of the Vietnamese
language, a Vietnamese treebank has been built. However, the
53
quality of this treebank is not satisfactory and is a possible source
for the low performance of Vietnamese language processing. We
have been building a new treebank for Vietnamese with about
40,000 sentences annotated with three layers: word segmentation,
part-of-speech tagging, and bracketing. In this paper, we describe
several challenges of Vietnamese language and how we solve them
in developing annotation guidelines. We also present our methods
to improve the quality of the annotation guidelines and ensure
annotation accuracy and consistency. Experiment results show
that inter-annotator agreement ratios and accuracy are higher than
90% which is satisfactory.
Correcting Errors in a Treebank Based on TreeMining
Kanta Suzuki, Yoshihide Kato and Shigeki Matsubara
This paper provides a new method to correct annotation errors
in a treebank. The previous error correction method constructs
a pseudo parallel corpus where incorrect partial parse trees are
paired with correct ones, and extracts error correction rules from
the parallel corpus. By applying these rules to a treebank, the
method corrects errors. However, this method does not achieve
wide coverage of error correction. To achieve wide coverage,
our method adopts a different approach. In our method, we
consider that an infrequent pattern which can be transformed to
a frequent one is an annotation error pattern. Based on a tree
mining technique, our method seeks such infrequent tree patterns,
and constructs error correction rules each of which consists of
an infrequent pattern and a corresponding frequent pattern. We
conducted an experiment using the Penn Treebank. We obtained
1,987 rules which are not constructed by the previous method, and
the rules achieved good precision.
4Couv: A New Treebank for French
Philippe Blache, Gregoire de Montcheuil, Laurent Prévotand Stéphane Rauzy
The question of the type of text used as primary data in treebanks
is of certain importance. First, it has an influence at the discourse
level: an article is not organized in the same way as a novel or a
technical document. Moreover, it also has consequences in terms
of semantic interpretation: some types of texts can be easier to
interpret than others. We present in this paper a new type of
treebank which presents the particularity to answer to specific
needs of experimental linguistic. It is made of short texts (book
backcovers) that presents a strong coherence in their organization
and can be rapidly interpreted. This type of text is adapted to short
reading sessions, making it easy to acquire physiological data (e.g.
eye movement, electroencepholagraphy). Such a resource offers
reliable data when looking for correlations between computational
models and human language processing.
CINTIL DependencyBank PREMIUM - A Corpusof Grammatical Dependencies for PortugueseRita de Carvalho, Andreia Querido, Marisa Campos, RitaValadas Pereira, João Silva and António Branco
This paper presents a new linguistic resource for the study
and computational processing of Portuguese. CINTIL
DependencyBank PREMIUM is a corpus of Portuguese news text,
accurately manually annotated with a wide range of linguistic
information (morpho-syntax, named-entities, syntactic function
and semantic roles), making it an invaluable resource specially for
the development and evaluation of data-driven natural language
processing tools. The corpus is under active development,
reaching 4,000 sentences in its current version. The paper also
reports on the training and evaluation of a dependency parser
over this corpus. CINTIL DependencyBank PREMIUM is freely-
available for research purposes through META-SHARE.
Estonian Dependency Treebank: from ConstraintGrammar tagset to Universal DependenciesKadri Muischnek, Kaili Müürisep and Tiina Puolakainen
This paper presents the first version of Estonian Universal
Dependencies Treebank which has been semi-automatically
acquired from Estonian Dependency Treebank and comprises ca
400,000 words (ca 30,000 sentences) representing the genres of
fiction, newspapers and scientific writing. Article analyses the
differences between two annotation schemes and the conversion
procedure to Universal Dependencies format. The conversion
has been conducted by manually created Constraint Grammar
transfer rules. As the rules enable to consider unbounded context,
include lexical information and both flat and tree structure features
at the same time, the method has proved to be reliable and
flexible enough to handle most of transformations. The automatic
conversion procedure achieved LAS 95.2%, UAS 96.3% and LA
98.4%. If punctuation marks were excluded from the calculations,
we observed LAS 96.4%, UAS 97.7% and LA 98.2%. Still
the refinement of the guidelines and methodology is needed
in order to re-annotate some syntactic phenomena, e.g. inter-
clausal relations. Although automatic rules usually make quite
a good guess even in obscure conditions, some relations should be
checked and annotated manually after the main conversion.
The Universal Dependencies Treebank of SpokenSlovenianKaja Dobrovoljc and Joakim Nivre
This paper presents the construction of an open-source
dependency treebank of spoken Slovenian, the first syntactically
annotated collection of spontaneous speech in Slovenian. The
treebank has been manually annotated using the Universal
54
Dependencies annotation scheme, a one-layer syntactic annotation
scheme with a high degree of cross-modality, cross-framework
and cross-language interoperability. In this original application
of the scheme to spoken language transcripts, we address a
wide spectrum of syntactic particularities in speech, either by
extending the scope of application of existing universal labels or
by proposing new speech-specific extensions. The initial analysis
of the resulting treebank and its comparison with the written
Slovenian UD treebank confirms significant syntactic differences
between the two language modalities, with spoken data consisting
of shorter and more elliptic sentences, less and simpler nominal
phrases, and more relations marking disfluencies, interaction,
deixis and modality.
Introducing the Asian Language Treebank (ALT)
Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finchand Eiichiro Sumita
This paper introduces the ALT project initiated by the Advanced
Speech Translation Research and Development Promotion Center
(ASTREC), NICT, Kyoto, Japan. The aim of this project is to
accelerate NLP research for Asian languages such as Indonesian,
Japanese, Khmer, Laos, Malay, Myanmar, Philippine, Thai and
Vietnamese. The original resource for this project was English
articles that were randomly selected from Wikinews. The project
has so far created a corpus for Myanmar and will extend in scope
to include other languages in the near future. A 20000-sentence
corpus of Myanmar that has been manually translated from an
English corpus has been word segmented, word aligned, part-of-
speech tagged and constituency parsed by human annotators. In
this paper, we present the implementation steps for creating the
treebank in detail, including a description of the ALT web-based
treebanking tool. Moreover, we report statistics on the annotation
quality of the Myanmar treebank created so far.
Universal Dependencies for Norwegian
Lilja Øvrelid and Petter Hohle
This article describes the conversion of the Norwegian
Dependency Treebank to the Universal Dependencies scheme.
This paper details the mapping of PoS tags, morphological
features and dependency relations and provides a description of
the structural changes made to NDT analyses in order to make
it compliant with the UD guidelines. We further present PoS
tagging and dependency parsing experiments which report first
results for the processing of the converted treebank. The full
converted treebank was made available with the 1.2 release of the
UD treebanks.
O17 - Language Resource PoliciesThursday, May 26, 9:45
Chairperson: Edouard Geoffrois Oral Session
Fostering the Next Generation of EuropeanLanguage Technology: Recent Developments –Emerging Initiatives – Challenges andOpportunities
Georg Rehm, Jan Hajic, Josef van Genabith and AndrejsVasiljevs
META-NET is a European network of excellence, founded
in 2010, that consists of 60 research centres in 34 European
countries. One of the key visions and goals of META-NET is a
truly multilingual Europe, which is substantially supported and
realised through language technologies. In this article we provide
an overview of recent developments around the multilingual
Europe topic, we also describe recent and upcoming events as well
as recent and upcoming strategy papers. Furthermore, we provide
overviews of two new emerging initiatives, the CEF.AT and ELRC
activity on the one hand and the Cracking the Language Barrier
federation on the other. The paper closes with several suggested
next steps in order to address the current challenges and to open
up new opportunities.
Yes, We Care! Results of the Ethics and NaturalLanguage Processing Surveys
Karën Fort and Alain Couillault
We present here the context and results of two surveys (a French
one and an international one) concerning Ethics and NLP, which
we designed and conducted between June and September 2015.
These surveys follow other actions related to raising concern
for ethics in our community, including a Journée d’études, a
workshop and the Ethics and Big Data Charter. The concern
for ethics shows to be quite similar in both surveys, despite a
few differences which we present and discuss. The surveys also
lead to think there is a growing awareness in the field concerning
ethical issues, which translates into a willingness to get involved
in ethics-related actions, to debate about the topic and to see ethics
be included in major conferences themes. We finally discuss the
limits of the surveys and the means of action we consider for the
future. The raw data from the two surveys are freely available
online.
55
Open Data Vocabularies for Assigning UsageRights to Data Resources from TranslationProjectsDavid Lewis, Kaniz Fatema, Alfredo Maldonado, BrianWalshe and Arturo Calvo
An assessment of the intellectual property requirements for data
used in machine-aided translation is provided based on a recent
EC-funded legal review. This is compared against the capabilities
offered by current linked open data standards from the W3C
for publishing and sharing translation memories from translation
projects, and proposals for adequately addressing the intellectual
property needs of stakeholders in translation projects using open
data vocabularies are suggested.
Language Resource Citation: the ISLRNDissemination and Further DevelopmentsValérie Mapelli, Vladimir Popescu, Lin Liu and KhalidChoukri
This article presents the latest dissemination activities and
technical developments that were carried out for the International
Standard Language Resource Number (ISLRN) service. It also
recalls the main principle and submission process for providers
to obtain their 13-digit ISLRN identifier. Up to March 2016,
2100 Language Resources were allocated an ISLRN number, not
only ELRA’s and LDC’s catalogued Language Resources, but
also the ones from other important organisations like the Joint
Research Centre (JRC) and the Resource Management Agency
(RMA) who expressed their strong support to this initiative.In
the research field, not only assigning a unique identification
number is important, but also referring to a Language Resource
as an object per se (like publications) has now become an
obvious requirement. The ISLRN could also become an important
parameter to be considered to compute a Language Resource
Impact Factor (LRIF) in order to recognize the merits of the
producers of Language Resources. Integrating the ISLRN number
into a LR-oriented bibliographical reference is thus part of the
objective. The idea is to make use of a BibTeX entry that
would take into account Language Resources items, including
ISLRN.The ISLRN being a requested field within the LREC 2016
submission, we expect that several other LRs will be allocated an
ISLRN number by the conference date. With this expansion, this
number aims to be a spreadly-used LR citation instrument within
works referring to LRs.
Trends in HLT Research: A Survey of LDC’s DataScholarship ProgramDenise DiPersio and Christopher Cieri
Since its inception in 2010, the Linguistic Data Consortium’s data
scholarship program has awarded no cost grants in data to 64
recipients from 26 countries. A survey of the twelve cycles to
date – two awards each in the Fall and Spring semesters from
Fall 2010 through Spring 2016 – yields an interesting view into
graduate program research trends in human language technology
and related fields and the particular data sets deemed important to
support that research. The survey also reveals regions in which
such activity appears to be on a rise, including in Arabic-speaking
regions and portions of the Americas and Asia.
O18 - Tweet Corpora and AnalysisThursday, May 26, 9:45
Chairperson: Bernardo Magnini Oral Session
Tweeting and Being Ironic in the Debate about aPolitical Reform: the French Annotated CorpusTWitter-MariagePourTous
Cristina Bosco, Mirko Lai, Viviana Patti and DanielaVirone
The paper introduces a new annotated French data set for
Sentiment Analysis, which is a currently missing resource. It
focuses on the collection from Twitter of data related to the socio-
political debate about the reform of the bill for wedding in France.
The design of the annotation scheme is described, which extends
a polarity label set by making available tags for marking target
semantic areas and figurative language devices. The annotation
process is presented and the disagreement discussed, in particular,
in the perspective of figurative language use and in that of the
semantic oriented annotation, which are open challenges for NLP
systems.
Towards a Corpus of Violence Acts in ArabicSocial Media
Ayman Alhelbawy, Poesio Massimo and Udo Kruschwitz
In this paper we present a new corpus of Arabic tweets that
mention some form of violent event, developed to support the
automatic identification of Human Rights Abuse. The dataset
was manually labelled for seven classes of violence using
crowdsourcing.
TwiSty: A Multilingual Twitter StylometryCorpus for Gender and Personality Profiling
Ben Verhoeven, Walter Daelemans and Barbara Plank
Personality profiling is the task of detecting personality traits of
authors based on writing style. Several personality typologies
exist, however, the Briggs-Myer Type Indicator (MBTI) is
particularly popular in the non-scientific community, and many
people use it to analyse their own personality and talk about the
results online. Therefore, large amounts of self-assessed data
56
on MBTI are readily available on social-media platforms such
as Twitter. We present a novel corpus of tweets annotated with
the MBTI personality type and gender of their author for six
Western European languages (Dutch, German, French, Italian,
Portuguese and Spanish). We outline the corpus creation and
annotation, show statistics of the obtained data distributions and
present first baselines on Myers-Briggs personality profiling and
gender prediction for all six languages.
Twitter as a Lifeline: Human-annotated TwitterCorpora for NLP of Crisis-related Messages
Muhammad Imran, Prasenjit Mitra and Carlos Castillo
Microblogging platforms such as Twitter provide active
communication channels during mass convergence and
emergency events such as earthquakes, typhoons. During the
sudden onset of a crisis situation, affected people post useful
information on Twitter that can be used for situational awareness
and other humanitarian disaster response efforts, if processed
timely and effectively. Processing social media information pose
multiple challenges such as parsing noisy, brief and informal
messages, learning information categories from the incoming
stream of messages and classifying them into different classes
among others. One of the basic necessities of many of these tasks
is the availability of data, in particular human-annotated data. In
this paper, we present human-annotated Twitter corpora collected
during 19 different crises that took place between 2013 and 2015.
To demonstrate the utility of the annotations, we train machine
learning classifiers. Moreover, we publish first largest word2vec
word embeddings trained on 52 million crisis-related tweets. To
deal with tweets language issues, we present human-annotated
normalized lexical resources for different lexical variations.
Functions of Code-Switching in Tweets: AnAnnotation Framework and Some InitialExperiments
Rafiya Begum, Kalika Bali, Monojit Choudhury, KoustavRudra and Niloy Ganguly
Code-Switching (CS) between two languages is extremely
common in communities with societal multilingualism where
speakers switch between two or more languages when interacting
with each other. CS has been extensively studied in
spoken language by linguists for several decades but with the
popularity of social-media and less formal Computer Mediated
Communication, we now see a big rise in the use of CS in
the text form. This poses interesting challenges and a need
for computational processing of such code-switched data. As
with any Computational Linguistic analysis and Natural Language
Processing tools and applications, we need annotated data for
understanding, processing, and generation of code-switched
language. In this study, we focus on CS between English and
Hindi Tweets extracted from the Twitter stream of Hindi-English
bilinguals. We present an annotation scheme for annotating
the pragmatic functions of CS in Hindi-English (Hi-En) code-
switched tweets based on a linguistic analysis and some initial
experiments.
O19 - Dependency TreebanksThursday, May 26, 9:45
Chairperson: Simonetta Montemagni Oral Session
Universal Dependencies for Japanese
Takaaki Tanaka, Yusuke Miyao, Masayuki Asahara, SumireUematsu, Hiroshi Kanayama, Shinsuke Mori and YujiMatsumoto
We present an attempt to port the international syntactic
annotation scheme, Universal Dependencies, to the Japanese
language in this paper. Since the Japanese syntactic structure
is usually annotated on the basis of unique chunk-based
dependencies, we first introduce word-based dependencies by
using a word unit called the Short Unit Word, which usually
corresponds to an entry in the lexicon UniDic. Porting is done
by mapping the part-of-speech tagset in UniDic to the universal
part-of-speech tagset, and converting a constituent-based treebank
to a typed dependency tree. The conversion is not straightforward,
and we discuss the problems that arose in the conversion and the
current solutions. A treebank consisting of 10,000 sentences was
built by converting the existent resources and currently released to
the public.
Universal Dependencies v1: A MultilingualTreebank Collection
Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter,Yoav Goldberg, Jan Hajic, Christopher D. Manning, RyanMcDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira,Reut Tsarfaty and Daniel Zeman
Cross-linguistically consistent annotation is necessary for sound
comparative evaluation and cross-lingual learning experiments.
It is also useful for multilingual system development and
comparative linguistic studies. Universal Dependencies is an
open community effort to create cross-linguistically consistent
treebank annotation for many languages within a dependency-
based lexicalist framework. In this paper, we describe v1 of the
57
universal guidelines, the underlying design principles, and the
currently available treebanks for 33 languages.
Construction of an English Dependency Corpusincorporating Compound Function WordsAkihiko Kato, Hiroyuki Shindo and Yuji Matsumoto
The recognition of multiword expressions (MWEs) in a sentence
is important for such linguistic analyses as syntactic and semantic
parsing, because it is known that combining an MWE into a
single token improves accuracy for various NLP tasks, such
as dependency parsing and constituency parsing. However,
MWEs are not annotated in Penn Treebank. Furthermore, when
converting word-based dependency to MWE-aware dependency
directly, one could combine nodes in an MWE into a single node.
Nevertheless, this method often leads to the following problem:
A node derived from an MWE could have multiple heads and
the whole dependency structure including MWE might be cyclic.
Therefore we converted a phrase structure to a dependency
structure after establishing an MWE as a single subtree. This
approach can avoid an occurrence of multiple heads and/or cycles.
In this way, we constructed an English dependency corpus taking
into account compound function words, which are one type of
MWEs that serve as functional expressions. In addition, we report
experimental results of dependency parsing using a constructed
corpus.
Adapting the TANL tool suite to UniversalDependenciesMaria Simi and Giuseppe Attardi
TANL is a suite of tools for text analytics based on the software
architecture paradigm of data driven pipelines. The strategies
for upgrading TANL to the use of Universal Dependencies range
from a minimalistic approach consisting of introducing pre/post-
processing steps into the native pipeline to revising the whole
pipeline. We explore the issue in the context of the Italian
Treebank, considering both the efforts involved, how to avoid
losing linguistically relevant information and the loss of accuracy
in the process. In particular we compare different strategies for
parsing and discuss the implications of simplifying the pipeline
when detailed part-of-speech and morphological annotations are
not available, as it is the case for less resourceful languages. The
experiments are relative to the Italian linguistic pipeline, but the
use of different parsers in our evaluations and the avoidance of
language specific tagging make the results general enough to be
useful in helping the transition to UD for other languages.
A Dependency Treebank of the Chinese BuddhistCanonTak-sum Wong and John Lee
We present a dependency treebank of the Chinese Buddhist
Canon, which contains 1,514 texts with about 50 million Chinese
characters. The treebank was created by an automatic parser
trained on a smaller treebank, containing four manually annotated
sutras (Lee and Kong, 2014). We report results on word
segmentation, part-of-speech tagging and dependency parsing,
and discuss challenges posed by the processing of medieval
Chinese. In a case study, we exploit the treebank to examine verbs
frequently associated with Buddha, and to analyze usage patterns
of quotative verbs in direct speech. Our results suggest that certain
quotative verbs imply status differences between the speaker and
the listener.
O20 - Word Sense DisambiguationThursday, May 26, 9:45
Chairperson: Nancy Ide Oral Session
Automatic Biomedical Term Polysemy Detection
Juan Antonio Lossio-Ventura, Clement Jonquet, MathieuRoche and Maguelonne Teisseire
Polysemy is the capacity for a word to have multiple meanings.
Polysemy detection is a first step for Word Sense Induction
(WSI), which allows to find different meanings for a term. The
polysemy detection is also important for information extraction
(IE) systems. In addition, the polysemy detection is important
for building/enriching terminologies and ontologies. In this paper,
we present a novel approach to detect if a biomedical term
is polysemic, with the long term goal of enriching biomedical
ontologies. This approach is based on the extraction of new
features. In this context we propose to extract features following
two manners: (i) extracted directly from the text dataset, and (ii)
from an induced graph. Our method obtains an Accuracy and F-
Measure of 0.978.
Cro36WSD: A Lexical Sample for Croatian WordSense Disambiguation
Domagoj Alagic and Jan Šnajder
We introduce Cro36WSD, a freely-available medium-sized
lexical sample for Croatian word sense disambiguation
(WSD).Cro36WSD comprises 36 words: 12 adjectives, 12 nouns,
and 12 verbs, balanced across both frequency bands and polysemy
levels. We adopt the multi-label annotation scheme in the hope
of lessening the drawbacks of discrete sense inventories and
obtaining more realistic annotations from human experts. Sense-
annotated data is collected through multiple annotation rounds to
ensure high-quality annotations: with a 115 person-hours effort
we reached an inter-annotator agreement score of 0.877. We
analyze the obtained data and perform a correlation analysis
58
between several relevant variables, including word frequency,
number of senses, sense distribution skewness, average annotation
time, and the observed inter-annotator agreement (IAA). Using
the obtained data, we compile multi- and single-labeled dataset
variants using different label aggregation schemes. Finally, we
evaluate three different baseline WSD models on both dataset
variants and report on the insights gained. We make both dataset
variants freely available.
Addressing the MFS Bias in WSD systems
Marten Postma, Ruben Izquierdo, Eneko Agirre, GermanRigau and Piek Vossen
Word Sense Disambiguation (WSD) systems tend to have a
strong bias towards assigning the Most Frequent Sense (MFS),
which results in high performance on the MFS but in a very
low performance on the less frequent senses. We addressed the
MFS bias in WSD systems by combining the output from a
WSD system with a set of mostly static features to create a MFS
classifier to decide when to and not to choose the MFS. The output
from this MFS classifier, which is based on the Random Forest
algorithm, is then used to modify the output from the original
WSD system. We applied our classifier to one of the state-of-the-
art supervised WSD systems, i.e. IMS, and to of the best state-of-
the-art unsupervised WSD systems, i.e. UKB. Our main finding is
that we are able to improve the system output in terms of choosing
between the MFS and the less frequent senses. When we apply the
MFS classifier to fine-grained WSD, we observe an improvement
on the less frequent sense cases, whereas we maintain the overall
recall.
A Large-Scale Multilingual Disambiguation ofGlosses
José Camacho-Collados, Claudio Delli Bovi, AlessandroRaganato and Roberto Navigli
Linking concepts and named entities to knowledge bases has
become a crucial Natural Language Understanding task. In
this respect, recent works have shown the key advantage
of exploiting textual definitions in various Natural Language
Processing applications. However, to date there are no reliable
large-scale corpora of sense-annotated textual definitions available
to the research community. In this paper we present a large-
scale high-quality corpus of disambiguated glosses in multiple
languages, comprising sense annotations of both concepts and
named entities from a unified sense inventory. Our approach
for the construction and disambiguation of the corpus builds
upon the structure of a large multilingual semantic network
and a state-of-the-art disambiguation system; first, we gather
complementary information of equivalent definitions across
different languages to provide context for disambiguation, and
then we combine it with a semantic similarity-based refinement.
As a result we obtain a multilingual corpus of textual definitions
featuring over 38 million definitions in 263 languages, and
we make it freely available at http://lcl.uniroma1.
it/disambiguated-glosses. Experiments on Open
Information Extraction and Sense Clustering show how two state-
of-the-art approaches improve their performance by integrating
our disambiguated corpus into their pipeline.
Unsupervised Ranked Cross-Lingual LexicalSubstitution for Low-Resource Languages
Stefan Ecker, Andrea Horbach and Stefan Thater
We propose an unsupervised system for a variant of cross-lingual
lexical substitution (CLLS) to be used in a reading scenario in
computer-assisted language learning (CALL), in which single-
word translations provided by a dictionary are ranked according
to their appropriateness in context. In contrast to most alternative
systems, ours does not rely on either parallel corpora or machine
translation systems, making it suitable for low-resource languages
as the language to be learned. This is achieved by a graph-based
scoring mechanism which can deal with ambiguous translations
of context words provided by a dictionary. Due to this
decoupling from the source language, we need monolingual
corpus resources only for the target language, i.e. the language
of the translation candidates. We evaluate our approach for
the language pair Norwegian Nynorsk-English on an exploratory
manually annotated gold standard and report promising results.
When running our system on the original SemEval CLLS task,
we rank 6th out of 18 (including 2 baselines and our 2 system
variants) in the best evaluation.
P19 - Discourse (2)Thursday, May 26, 9:45
Chairperson: Olga Uryupina Poster Session
Information structure in the PotsdamCommentary Corpus: Topics
Manfred Stede and Sara Mamprin
The Potsdam Commentary Corpus is a collection of 175 German
newspaper commentaries annotated on a variety of different
layers. This paper introduces a new layer that covers the linguistic
notion of information-structural topic (not to be confused with
‘topic’ as applied to documents in information retrieval). To our
knowledge, this is the first larger topic-annotated resource for
German (and one of the first for any language). We describe the
annotation guidelines and the annotation process, and the results
of an inter-annotator agreement study, which compare favourably
59
to the related work. The annotated corpus is freely available for
research.
A Corpus of Clinical Practice GuidelinesAnnotated with the Importance ofRecommendations
Jonathon Read, Erik Velldal, Marc Cavazza and GersendeGeorg
In this paper we present the Corpus of REcommendation
STrength (CREST), a collection of HTML-formatted clinical
guidelines annotated with the location of recommendations.
Recommendations are labelled with an author-provided indicator
of their strength of importance. As data was drawn from many
disparate authors, we define a unified scheme of importance
labels, and provide a mapping for each guideline. We
demonstrate the utility of the corpus and its annotations in
some initial measurements investigating the type of language
constructions associated with strong and weak recommendations,
and experiments into promising features for recommendation
classification, both with respect to strong and weak labels, and to
all labels of the unified scheme. An error analysis indicates that,
while there is a strong relationship between lexical choices and
strength labels, there can be substantial variance in the choices
made by different authors.
The Methodius Corpus of Rhetorical DiscourseStructures and Generated Texts
Amy Isard
Using the Methodius Natural Language Generation (NLG)
System, we have created a corpus which consists of a collection
of generated texts which describe ancient Greek artefacts. Each
text is linked to two representations created as part of the NLG
process. The first is a content plan, which uses rhetorical relations
to describe the high-level discourse structure of the text, and the
second is a logical form describing the syntactic structure, which
is sent to the OpenCCG surface realization module to produce
the final text output. In recent work, White and Howcroft (2015)
have used the SPaRKy restaurant corpus, which contains similar
combination of texts and representations, for their research on the
induction of rules for the combination of clauses. In the first
instance this corpus will be used to test their algorithms on an
additional domain, and extend their work to include the learning
of referring expression generation rules. As far as we know, the
SPaRKy restaurant corpus is the only existing corpus of this type,
and we hope that the creation of this new corpus in a different
domain will provide a useful resource to the Natural Language
Generation community.
Applying Core Scientific Concepts toContext-Based Citation RecommendationDaniel Duma, Maria Liakata, Amanda Clare, JamesRavenscroft and Ewan Klein
The task of recommending relevant scientific literature for a draft
academic paper has recently received significant interest. In our
effort to ease the discovery of scientific literature and augment
scientific writing, we aim to improve the relevance of results
based on a shallow semantic analysis of the source document
and the potential documents to recommend. We investigate the
utility of automatic argumentative and rhetorical annotation of
documents for this purpose. Specifically, we integrate automatic
Core Scientific Concepts (CoreSC) classification into a prototype
context-based citation recommendation system and investigate its
usefulness to the task. We frame citation recommendation as
an information retrieval task and we use the categories of the
annotation schemes to apply different weights to the similarity
formula. Our results show interesting and consistent correlations
between the type of citation and the type of sentence containing
the relevant information.
SciCorp: A Corpus of English Scientific ArticlesAnnotated for Information Status AnalysisIna Roesiger
This paper presents SciCorp, a corpus of full-text English
scientific papers of two disciplines, genetics and computational
linguistics. The corpus comprises co-reference and bridging
information as well as information status labels. Since SciCorp
is annotated with both labels and the respective co-referent and
bridging links, we believe it is a valuable resource for NLP
researchers working on scientific articles or on applications such
as co-reference resolution, bridging resolution or information
status classification. The corpus has been reliably annotated
by independent human coders with moderate inter-annotator
agreement (average kappa = 0.71). In total, we have annotated
14 full papers containing 61,045 tokens and marked 8,708 definite
noun phrases. The paper describes in detail the annotation scheme
as well as the resulting corpus. The corpus is available for
download in two different formats: in an offset-based format
and for the co-reference annotations in the widely-used, tabular
CoNLL-2012 format.
Using lexical and Dependency Features toDisambiguate Discourse Connectives in HindiRohit Jain, Himanshu Sharma and Dipti Sharma
Discourse parsing is a challenging task in NLP and plays a
crucial role in discourse analysis. To enable discourse analysis
60
for Hindi, Hindi Discourse Relations Bank was created on a
subset of Hindi TreeBank. The benefits of a discourse analyzer
in automated discourse analysis, question summarization and
question answering domains has motivated us to begin work
on a discourse analyzer for Hindi. In this paper, we focus on
discourse connective identification for Hindi. We explore various
available syntactic features for this task. We also explore the
use of dependency tree parses present in the Hindi TreeBank
and study the impact of the same on the performance of the
system. We report that the novel dependency features introduced
have a higher impact on precision, in comparison to the syntactic
features previously used for this task. In addition, we report a high
accuracy of 96% for this task.
Annotating Topic Development in InformationSeeking Queries
Marta Andersson, Adnan Ozturel and Silvia Pareti
This paper contributes to the limited body of empirical research in
the domain of discourse structure of information seeking queries.
We describe the development of an annotation schema for coding
topic development in information seeking queries and the initial
observations from a pilot sample of query sessions. The main
idea that we explore is the relationship between constant and
variable discourse entities and their role in tracking changes in the
topic progression. We argue that the topicalized entities remain
stable across development of the discourse and can be identified
by a simple mechanism where anaphora resolution is a precursor.
We also claim that a corpus annotated in this framework can be
used as training data for dialogue management and computational
semantics systems.
Searching in the Penn Discourse Treebank Usingthe PML-Tree Query
Jirí Mírovský, Lucie Poláková and Jan Štepánek
The PML-Tree Query is a general, powerful and user-friendly
system for querying richly linguistically annotated treebanks.
The paper shows how the PML-Tree Query can be used for
searching for discourse relations in the Penn Discourse Treebank
2.0 mapped onto the syntactic annotation of the Penn Treebank.
The OpenCourseWare Metadiscourse (OCWMD)Corpus
Ghada Alharbi and Thomas Hain
This study describes a new corpus of over 60,000 hand-annotated
metadiscourse acts from 106 OpenCourseWare lectures, from two
different disciplines: Physics and Economics. Metadiscourse is a
set of linguistic expressions that signal different functions in the
discourse. This type of language is hypothesised to be helpful in
finding a structure in unstructured text, such as lectures discourse.
A brief summary is provided about the annotation scheme and
labelling procedures, inter-annotator reliability statistics, overall
distributional statistics, a description of auxiliary data that will
be distributed with the corpus, and information relating to how
to obtain the data. The results provide a deeper understanding of
lecture structure and confirm the reliable coding of metadiscursive
acts in academic lectures across different disciplines. The next
stage of our research will be to build a classification model to
automate the tagging process, instead of manual annotation, which
take time and efforts. This is in addition to the use of these tags as
indicators of the higher level structure of lecture discourse.
Ubuntu-fr: A Large and Open Corpus forMulti-modal Analysis of Online WrittenConversations
Nicolas Hernandez, Soufian Salim and Elizaveta LoginovaClouet
We present a large, free, French corpus of online written
conversations extracted from the Ubuntu platform’s forums,
mailing lists and IRC channels. The corpus is meant to
support multi-modality and diachronic studies of online written
conversations. We choose to build the corpus around a robust
metadata model based upon strong principles, such as the “stand
off” annotation principle. We detail the model, we explain how
the data was collected and processed - in terms of meta-data, text
and conversation - and we detail the corpus’contents through a
series of meaningful statistics. A portion of the corpus - about
4,700 sentences from emails, forum posts and chat messages
sent in November 2014 - is annotated in terms of dialogue acts
and sentiment. We discuss how we adapted our dialogue act
taxonomy from the DIT++ annotation scheme and how the data
was annotated, before presenting our results as well as a brief
qualitative analysis of the annotated data.
DUEL: A Multi-lingual Multimodal DialogueCorpus for Disfluency, Exclamations andLaughter
Julian Hough, Ye Tian, Laura de Ruiter, Simon Betz, SpyrosKousidis, David Schlangen and Jonathan Ginzburg
We present the DUEL corpus, consisting of 24 hours of natural,
face-to-face, loosely task-directed dialogue in German, French
and Mandarin Chinese. The corpus is uniquely positioned as
a cross-linguistic, multimodal dialogue resource controlled for
domain. DUEL includes audio, video and body tracking data
61
and is transcribed and annotated for disfluency, laughter and
exclamations.
P20 - Document Classification and TextCategorisation (1)Thursday, May 26, 9:45
Chairperson: Fabio Tamburini Poster Session
Character-Level Neural Translation forMultilingual Media Monitoring in the SUMMAProject
Guntis Barzdins, Steve Renals and Didzis Gosko
The paper steps outside the comfort-zone of the traditional NLP
tasks like automatic speech recognition (ASR) and machine
translation (MT) to addresses two novel problems arising in the
automated multilingual news monitoring: segmentation of the
TV and radio program ASR transcripts into individual stories,
and clustering of the individual stories coming from various
sources and languages into storylines. Storyline clustering
of stories covering the same events is an essential task for
inquisitorial media monitoring. We address these two problems
jointly by engaging the low-dimensional semantic representation
capabilities of the sequence to sequence neural translation
models. To enable joint multi-task learning for multilingual
neural translation of morphologically rich languages we replace
the attention mechanism with the sliding-window mechanism
and operate the sequence to sequence neural translation model
on the character-level rather than on the word-level. The
story segmentation and storyline clustering problem is tackled
by examining the low-dimensional vectors produced as a side-
product of the neural translation process. The results of this paper
describe a novel approach to the automatic story segmentation and
storyline clustering problem.
Exploring the Realization of Irony in Twitter Data
Cynthia Van Hee, Els Lefever and Veronique Hoste
Handling figurative language like irony is currently a challenging
task in natural language processing. Since irony is commonly
used in user-generated content, its presence can significantly
undermine accurate analysis of opinions and sentiment in such
texts. Understanding irony is therefore important if we want to
push the state-of-the-art in tasks such as sentiment analysis. In this
research, we present the construction of a Twitter dataset for two
languages, being English and Dutch, and the development of new
guidelines for the annotation of verbal irony in social media texts.
Furthermore, we present some statistics on the annotated corpora,
from which we can conclude that the detection of contrasting
evaluations might be a good indicator for recognizing irony.
Discriminating Similar Languages: Evaluationsand Explorations
Cyril Goutte, Serge Léger, Shervin Malmasi and MarcosZampieri
We present an analysis of the performance of machine learning
classifiers on discriminating between similar languages and
language varieties. We carried out a number of experiments using
the results of the two editions of the Discriminating between
Similar Languages (DSL) shared task. We investigate the progress
made between the two tasks, estimate an upper bound on possible
performance using ensemble and oracle combination, and provide
learning curves to help us understand which languages are more
challenging. A number of difficult sentences are identified and
investigated further with human annotation
Compilation of an Arabic Children’s Corpus
Latifa Al-Sulaiti, Noorhan Abbas, Claire Brierley, EricAtwell and Ayman Alghamdi
Inspired by the Oxford Children’s Corpus, we have developed
a prototype corpus of Arabic texts written and/or selected for
children. Our Arabic Children’s Corpus of 2950 documents and
nearly 2 million words has been collected manually from the web
during a 3-month project. It is of high quality, and contains a range
of different children’s genres based on sources located, including
classic tales from The Arabian Nights, and popular fictional
characters such as Goha. We anticipate that the current and
subsequent versions of our corpus will lead to interesting studies
in text classification, language use, and ideology in children’s
texts.
Quality Assessment of the Reuters Vol. 2Multilingual Corpus
Robin Eriksson
We introduce a framework for quality assurance of corpora, and
apply it to the Reuters Multilingual Corpus (RCV2). The results
of this quality assessment of this standard newsprint corpus reveal
a significant duplication problem and, to a lesser extent, a problem
with corrupted articles. From the raw collection of some 487,000
articles, almost one tenth are trivial duplicates. A smaller fraction
of articles appear to be corrupted and should be excluded for
that reason. The detailed results are being made available as
on-line appendices to this article. This effort also demonstrates
the beginnings of a constraint-based methodological framework
for quality assessment and quality assurance for corpora. As
a first implementation of this framework, we have investigated
62
constraints to verify sample integrity, and to diagnose sample
duplication, entropy aberrations, and tagging inconsistencies. To
help identify near-duplicates in the corpus, we have employed
both entropy measurements and a simple byte bigram incidence
digest.
Learning Tone and Attribution for Financial TextMining
Mahmoud El-Haj, Paul Rayson, Steve Young, AndrewMoore, Martin Walker, Thomas Schleicher and VasilikiAthanasakou
Attribution bias refers to the tendency of people to attribute
successes to their own abilities but failures to external factors. In
a business context an internal factor might be the restructuring of
the firm and an external factor might be an unfavourable change
in exchange or interest rates. In accounting research, the presence
of an attribution bias has been demonstrated for the narrative
sections of the annual financial reports. Previous studies have
applied manual content analysis to this problem but in this paper
we present novel work to automate the analysis of attribution bias
through using machine learning algorithms. Previous studies have
only applied manual content analysis on a small scale to reveal
such a bias in the narrative section of annual financial reports. In
our work a group of experts in accounting and finance labelled
and annotated a list of 32,449 sentences from a random sample of
UK Preliminary Earning Announcements (PEAs) to allow us to
examine whether sentences in PEAs contain internal or external
attribution and which kinds of attributions are linked to positive
or negative performance. We wished to examine whether human
annotators could agree on coding this difficult task and whether
Machine Learning (ML) could be applied reliably to replicate the
coding process on a much larger scale. Our best machine learning
algorithm correctly classified performance sentences with 70%
accuracy and detected tone and attribution in financial PEAs with
accuracy of 79%.
A Comparative Study of Text PreprocessingApproaches for Topic Detection of UserUtterances
Roman Sergienko, Muhammad Shan and Wolfgang Minker
The paper describes a comparative study of existing and novel text
preprocessing and classification techniques for domain detection
of user utterances. Two corpora are considered. The first
one contains customer calls to a call centre for further call
routing; the second one contains answers of call centre employees
with different kinds of customer orientation behaviour. Seven
different unsupervised and supervised term weighting methods
were applied. The collective use of term weighting methods
is proposed for classification effectiveness improvement. Four
different dimensionality reduction methods were applied: stop-
words filtering with stemming, feature selection based on term
weights, feature transformation based on term clustering, and a
novel feature transformation method based on terms belonging to
classes. As classification algorithms we used k-NN and a SVM-
based algorithm. The numerical experiments have shown that the
simultaneous use of the novel proposed approaches (collectives
of term weighting methods and the novel feature transformation
method) allows reaching the high classification results with very
small number of features.
UPPC - Urdu Paraphrase Plagiarism Corpus
Muhammad Sharjeel, Paul Rayson and Rao MuhammadAdeel Nawab
Paraphrase plagiarism is a significant and widespread problem
and research shows that it is hard to detect. Several methods and
automatic systems have been proposed to deal with it. However,
evaluation and comparison of such solutions is not possible
because of the unavailability of benchmark corpora with manual
examples of paraphrase plagiarism. To deal with this issue, we
present the novel development of a paraphrase plagiarism corpus
containing simulated (manually created) examples in the Urdu
language - a language widely spoken around the world. This
resource is the first of its kind developed for the Urdu language and
we believe that it will be a valuable contribution to the evaluation
of paraphrase plagiarism detection systems.
Identifying Content Types of Messages Related toOpen Source Software Projects
Yannis Korkontzelos, Paul Thompson and SophiaAnaniadou
Assessing the suitability of an Open Source Software project for
adoption requires not only an analysis of aspects related to the
code, such as code quality, frequency of updates and new version
releases, but also an evaluation of the quality of support offered
in related online forums and issue trackers. Understanding the
content types of forum messages and issue trackers can provide
information about the extent to which requests are being addressed
and issues are being resolved, the percentage of issues that are
not being fixed, the cases where the user acknowledged that the
issue was successfully resolved, etc. These indicators can provide
potential adopters of the OSS with estimates about the level of
available support. We present a detailed hierarchy of content
types of online forum messages and issue tracker comments
and a corpus of messages annotated accordingly. We discuss
our experiments to classify forum messages and issue tracker
63
comments into content-related classes, i.e. to assign them to nodes
of the hierarchy. The results are very encouraging.
Emotion Corpus Construction Based on Selectionfrom HashtagsMinglei Li, Yunfei Long, Lu Qin and Wenjie Li
The availability of labelled corpus is of great importance for
supervised learning in emotion classification tasks. Because it is
time-consuming to manually label text, hashtags have been used
as naturally annotated labels to obtain a large amount of labelled
training data from microblog. However, natural hashtags contain
too much noise for it to be used directly in learning algorithms.
In this paper, we design a three-stage semi-automatic method to
construct an emotion corpus from microblogs. Firstly, a lexicon
based voting approach is used to verify the hashtag automatically.
Secondly, a SVM based classifier is used to select the data whose
natural labels are consistent with the predicted labels. Finally,
the remaining data will be manually examined to filter out the
noisy data. Out of about 48K filtered Chinese microblogs, 39k
microblogs are selected to form the final corpus with the Kappa
value reaching over 0.92 for the automatic parts and over 0.81 for
the manual part. The proportion of automatic selection reaches
54.1%. Thus, the method can reduce about 44.5% of manual
workload for acquiring quality data. Experiment on a classifier
trained on this corpus shows that it achieves comparable results
compared to the manually annotated NLP&CC2013 corpus.
P21 - Evaluation Methodologies (2)Thursday, May 26, 9:45
Chairperson: António Branco Poster Session
Comparing the Level of Code-Switching inCorporaBjörn Gambäck and Amitava Das
Social media texts are often fairly informal and conversational,
and when produced by bilinguals tend to be written in
several different languages simultaneously, in the same way as
conversational speech. The recent availability of large social
media corpora has thus also made large-scale code-switched
resources available for research. The paper addresses the issues of
evaluation and comparison these new corpora entail, by defining
an objective measure of corpus level complexity of code-switched
texts. It is also shown how this formal measure can be used in
practice, by applying it to several code-switched corpora.
Evaluation of the KIT Lecture Translation SystemMarkus Müller, Sarah Fünfer, Sebastian Stüker and AlexWaibel
To attract foreign students is among the goals of the Karlsruhe
Institute of Technology (KIT). One obstacle to achieving this goal
is that lectures at KIT are usually held in German which many
foreign students are not sufficiently proficient in, as, e.g., opposed
to English. While the students from abroad are learning German
during their stay at KIT, it is challenging to become proficient
enough in it in order to follow a lecture. As a solution to this
problem we offer our automatic simultaneous lecture translation.
It translates German lectures into English in real time. While
not as good as human interpreters, the system is available at a
price that KIT can afford in order to offer it in potentially all
lectures. In order to assess whether the quality of the system we
have conducted a user study. In this paper we present this study,
the way it was conducted and its results. The results indicate that
the quality of the system has passed a threshold as to be able to
support students in their studies. The study has helped to identify
the most crucial weaknesses of the systems and has guided us
which steps to take next.
The ACL RD-TEC 2.0: A Language Resource forEvaluating Term Extraction and EntityRecognition Methods
Behrang QasemiZadeh and Anne-Kathrin Schumann
This paper introduces the ACL Reference Dataset for Terminology
Extraction and Classification, version 2.0 (ACL RD-TEC 2.0).
The ACL RD-TEC 2.0 has been developed with the aim of
providing a benchmark for the evaluation of term and entity
recognition tasks based on specialised text from the computational
linguistics domain. This release of the corpus consists of 300
abstracts from articles in the ACL Anthology Reference Corpus,
published between 1978–2006. In these abstracts, terms (i.e.,
single or multi-word lexical units with a specialised meaning)
are manually annotated. In addition to their boundaries in
running text, annotated terms are classified into one of the seven
categories method, tool, language resource (LR), LR product,
model, measures and measurements, and other. To assess the
quality of the annotations and to determine the difficulty of this
annotation task, more than 171 of the abstracts are annotated
twice, independently, by each of the two annotators. In total, 6,818
terms are identified and annotated in more than 1300 sentences,
resulting in a specialised vocabulary made of 3,318 lexical forms,
mapped to 3,471 concepts. We explain the development of the
annotation guidelines and discuss some of the challenges we
encountered in this annotation task.
Building an Arabic Machine TranslationPost-Edited Corpus: Guidelines and Annotation
Wajdi Zaghouani, Nizar Habash, Ossama Obeid, BehrangMohit, Houda Bouamor and Kemal Oflazer
We present our guidelines and annotation procedure to create a
human corrected machine translated post-edited corpus for the
64
Modern Standard Arabic. Our overarching goal is to use the
annotated corpus to develop automatic machine translation post-
editing systems for Arabic that can be used to help accelerate
the human revision process of translated texts. The creation of
any manually annotated corpus usually presents many challenges.
In order to address these challenges, we created comprehensive
and simplified annotation guidelines which were used by a team
of five annotators and one lead annotator. In order to ensure
a high annotation agreement between the annotators, multiple
training sessions were held and regular inter-annotator agreement
measures were performed to check the annotation quality. The
created corpus of manual post-edited translations of English to
Arabic articles is the largest to date for this language pair.
Tools and Guidelines for Principled MachineTranslation Development
Nora Aranberri, Eleftherios Avramidis, AljoschaBurchardt, Ondrej Klejch, Martin Popel and Maja Popovic
This work addresses the need to aid Machine Translation (MT)
development cycles with a complete workflow of MT evaluation
methods. Our aim is to assess, compare and improve MT
system variants. We hereby report on novel tools and practices
that support various measures, developed in order to support
a principled and informed approach of MT development. Our
toolkit for automatic evaluation showcases quick and detailed
comparison of MT system variants through automatic metrics and
n-gram feedback, along with manual evaluation via edit-distance,
error annotation and task-based feedback.
Generating Task-Pertinent sorted Error Lists forSpeech Recognition
Olivier Galibert, Mohamed Ameur Ben Jannet, JulietteKahn and Sophie Rosset
Automatic Speech recognition (ASR) is one of the most widely
used components in spoken language processing applications.
ASR errors are of varying importance with respect to the
application, making error analysis keys to improving speech
processing applications. Knowing the most serious errors for
the applicative case is critical to build better systems. In the
context of Automatic Speech Recognition (ASR) used as a first
step towards Named Entity Recognition (NER) in speech, error
seriousness is usually determined by their frequency, due to the
use of the WER as metric to evaluate the ASR output, despite the
emergence of more relevant measures in the literature. We propose
to use a different evaluation metric form the literature in order to
classify ASR errors according to their seriousness for NER. Our
results show that the ASR errors importance is ranked differently
depending on the used evaluation metric. A more detailed analysis
shows that the estimation of the error impact given by the ATENE
metric is more adapted to the NER task than the estimation based
only on the most used frequency metric WER.
P22 - Information Extraction and Retrieval(2)Thursday, May 26, 9:45
Chairperson: Robert Gaizauskas Poster Session
A Study of Reuse and Plagiarism in LREC papers
Gil Francopoulo, Joseph Mariani and Patrick Paroubek
The aim of this experiment is to present an easy way to compare
fragments of texts in order to detect (supposed) results of copy
& paste operations between articles in the domain of Natural
Language Processing (NLP). The search space of the comparisons
is a corpus labeled as NLP4NLP gathering a large part of the NLP
field. The study is centered on LREC papers in both directions,
first with an LREC paper borrowing a fragment of text from the
collection, and secondly in the reverse direction with fragments of
LREC documents borrowed and inserted in the collection.
Developing a Dataset for Evaluating Approachesfor Document Expansion with Images
Debasis Ganguly, Iacer Calixto and Gareth Jones
Motivated by the adage that a “picture is worth a thousand words”
it can be reasoned that automatically enriching the textual content
of a document with relevant images can increase the readability
of a document. Moreover, features extracted from the additional
image data inserted into the textual content of a document may, in
principle, be also be used by a retrieval engine to better match the
topic of a document with that of a given query. In this paper, we
describe our approach of building a ground truth dataset to enable
further research into automatic addition of relevant images to text
documents. The dataset is comprised of the official ImageCLEF
2010 collection (a collection of images with textual metadata) to
serve as the images available for automatic enrichment of text, a
set of 25 benchmark documents that are to be enriched, which in
this case are children’s short stories, and a set of manually judged
relevant images for each query story obtained by the standard
procedure of depth pooling. We use this benchmark dataset
to evaluate the effectiveness of standard information retrieval
methods as simple baselines for this task. The results indicate that
using the whole story as a weighted query, where the weight of
65
each query term is its tf-idf value, achieves an precision of 0:1714
within the top 5 retrieved images on an average.
More than Word Cooccurrence: ExploringSupport and Opposition in International ClimateNegotiations with Semantic ParsingPablo Ruiz, Clément Plancq and Thierry Poibeau
Text analysis methods widely used in digital humanities often
involve word co-occurrence, e.g. concept co-occurrence
networks. These methods provide a useful corpus overview,
but cannot determine the predicates that relate co-occurring
concepts. Our goal was identifying propositions expressing
the points supported or opposed by participants in international
climate negotiations. Word co-occurrence methods were not
sufficient, and an analysis based on open relation extraction had
limited coverage for nominal predicates. We present a pipeline
which identifies the points that different actors support and
oppose, via a domain model with support/opposition predicates,
and analysis rules that exploit the output of semantic role
labelling, syntactic dependencies and anaphora resolution. Entity
linking and keyphrase extraction are also performed on the
propositions related to each actor. A user interface allows
examining the main concepts in points supported or opposed
by each participant, which participants agree or disagree with
each other, and about which issues. The system is an example
of tools that digital humanities scholars are asking for, to
render rich textual information (beyond word co-occurrence) more
amenable to quantitative treatment. An evaluation of the tool was
satisfactory.
A Sequence Model Approach to RelationExtraction in PortugueseSandra Collovini, Gabriel Machado and Renata Vieira
The task of Relation Extraction from texts is one of the main
challenges in the area of Information Extraction, considering
the required linguistic knowledge and the sophistication of the
language processing techniques employed. This task aims
at identifying and classifying semantic relations that occur
between entities recognized in a given text. In this paper,
we evaluated a Conditional Random Fields classifier for the
extraction of any relation descriptor occurring between named
entities (Organisation, Person and Place categories), as well as
pre-defined relation types between these entities in Portuguese
texts.
Evaluation Set for Slovak News InformationRetrievalDaniel Hládek, Ján Staš and Jozef Juhár
This work proposes an information retrieval evaluation set for
the Slovak language. A set of 80 queries written in the natural
language is given together with the set of relevant documents.
The document set contains 3980 newspaper articles sorted into 6
categories. Each document in the result set is manually annotated
for relevancy with its corresponding query. The evaluation set
is mostly compatible with the Cranfield test collection using the
same methodology for queries and annotation of relevancy. In
addition to that it provides annotation for document title, author,
publication date and category that can be used for evaluation of
automatic document clustering and categorization.
Analyzing Time Series Changes of Correlationbetween Market Share and Concerns onCompanies measured through Search EngineSuggests
Takakazu Imada, Yusuke Inoue, Lei Chen, Syunya Doi, TianNie, Chen Zhao, Takehito Utsuro and Yasuhide Kawada
This paper proposes how to utilize a search engine in order to
predict market shares. We propose to compare rates of concerns of
those who search for Web pages among several companies which
supply products, given a specific products domain. We measure
concerns of those who search for Web pages through search
engine suggests. Then, we analyze whether rates of concerns
of those who search for Web pages have certain correlation with
actual market share. We show that those statistics have certain
correlations. We finally propose how to predict the market share
of a specific product genre based on the rates of concerns of those
who search for Web pages.
TermITH-Eval: a French Standard-BasedResource for Keyphrase Extraction Evaluation
Adrien Bougouin, Sabine Barreaux, Laurent Romary,Florian Boudin and Beatrice Daille
Keyphrase extraction is the task of finding phrases that represent
the important content of a document. The main aim of keyphrase
extraction is to propose textual units that represent the most
important topics developed in a document. The output keyphrases
of automatic keyphrase extraction methods for test documents
are typically evaluated by comparing them to manually assigned
reference keyphrases. Each output keyphrase is considered
correct if it matches one of the reference keyphrases. However,
the choice of the appropriate textual unit (keyphrase) for a
topic is sometimes subjective and evaluating by exact matching
underestimates the performance. This paper presents a dataset of
evaluation scores assigned to automatically extracted keyphrases
by human evaluators. Along with the reference keyphrases,
the manual evaluations can be used to validate new evaluation
measures. Indeed, an evaluation measure that is highly correlated
66
to the manual evaluation is appropriate for the evaluation of
automatic keyphrase extraction methods.
The Royal Society Corpus: From Uncharted Datato Corpus
Hannah Kermes, Stefania Degaetano-Ortlieb, AshrafKhamis, Jörg Knappen and Elke Teich
We present the Royal Society Corpus (RSC) built from the
Philosophical Transactions and Proceedings of the Royal Society
of London. At present, the corpus contains articles from the
first two centuries of the journal (1665–1869) and amounts to
around 35 million tokens. The motivation for building the RSC
is to investigate the diachronic linguistic development of scientific
English. Specifically, we assume that due to specialization,
linguistic encodings become more compact over time (Halliday,
1988; Halliday and Martin, 1993), thus creating a specific
discourse type characterized by high information density that is
functional for expert communication. When building corpora
from uncharted material, typically not all relevant meta-data
(e.g. author, time, genre) or linguistic data (e.g. sentence/word
boundaries, words, parts of speech) is readily available. We
present an approach to obtain good quality meta-data and base
text data adopting the concept of Agile Software Development.
Building Evaluation Datasets forConsumer-Oriented Information Retrieval
Lorraine Goeuriot, Liadh Kelly, Guido Zuccon and JoaoPalotti
Common people often experience difficulties in accessing
relevant, correct, accurate and understandable health information
online. Developing search techniques that aid these information
needs is challenging. In this paper we present the datasets created
by CLEF eHealth Lab from 2013-2015 for evaluation of search
solutions to support common people finding health information
online. Specifically, the CLEF eHealth information retrieval
(IR) task of this Lab has provided the research community with
benchmarks for evaluating consumer-centered health information
retrieval, thus fostering research and development aimed to
address this challenging problem. Given consumer queries, the
goal of the task is to retrieve relevant documents from the provided
collection of web pages. The shared datasets provide a large health
web crawl, queries representing people’s real world information
needs, and relevance assessment judgements for the queries.
A Dataset for Open Event Extraction in English
Kiem-Hieu Nguyen, Xavier Tannier, Olivier Ferret andRomaric Besançon
This article presents a corpus for development and testing of event
schema induction systems in English. Schema induction is the
task of learning templates with no supervision from unlabeled
texts, and to group together entities corresponding to the same
role in a template. Most of the previous work on this subject relies
on the MUC-4 corpus. We describe the limits of using this corpus
(size, non-representativeness, similarity of roles across templates)
and propose a new, partially-annotated corpus in English which
remedies some of these shortcomings. We make use of Wikinews
to select the data inside the category Laws & Justice, and query
Google search engine to retrieve different documents on the
same events. Only Wikinews documents are manually annotated
and can be used for evaluation, while the others can be used
for unsupervised learning. We detail the methodology used for
building the corpus and evaluate some existing systems on this
new data.
P23 - Prosody and PhonologyThursday, May 26, 9:45
Chairperson: Björn Schuller Poster Session
Phoneme Alignment Using the Information onPhonological Processes in Continuous Speech
Daniil Kocharov
The current study focuses on optimization of Levenshtein
algorithm for the purpose of computing the optimal alignment
between two phoneme transcriptions of spoken utterance
containing sequences of phonetic symbols. The alignment is
computed with the help of a confusion matrix in which costs for
phonetic symbol deletion, insertion and substitution are defined
taking into account various phonological processes that occur in
fluent speech, such as anticipatory assimilation, phone elision and
epenthesis. The corpus containing about 30 hours of Russian
read speech was used to evaluate the presented algorithms.
The experimental results have shown significant reduction of
misalignment rate in comparison with the baseline Levenshtein
algorithm: the number of errors has been reduced from 1.1 % to
0.28 %
CoRuSS - a New Prosodically Annotated Corpusof Russian Spontaneous Speech
Tatiana Kachkovskaia, Daniil Kocharov, Pavel Skrelin andNina Volskaya
This paper describes speech data recording, processing and
annotation of a new speech corpus CoRuSS (Corpus of
Russian Spontaneous Speech), which is based on connected
communicative speech recorded from 60 native Russian male and
female speakers of different age groups (from 16 to 77). Some
Russian speech corpora available at the moment contain plain
orthographic texts and provide some kind of limited annotation,
67
but there are no corpora providing detailed prosodic annotation of
spontaneous conversational speech. This corpus contains 30 hours
of high quality recorded spontaneous Russian speech, half of it has
been transcribed and prosodically labeled. The recordings consist
of dialogues between two speakers, monologues (speakers’ self-
presentations) and reading of a short phonetically balanced text.
Since the corpus is labeled for a wide range of linguistic - phonetic
and prosodic - information, it provides basis for empirical
studies of various spontaneous speech phenomena as well as for
comparison with those we observe in prepared read speech. Since
the corpus is designed as a open-access resource of speech data,
it will also make possible to advance corpus-based analysis of
spontaneous speech data across languages and speech technology
development as well.
Defining and Counting Phonological Classes inCross-linguistic Segment Databases
Dan Dediu and Scott Moisik
Recently, there has been an explosion in the availability of large,
good-quality cross-linguistic databases such as WALS (Dryer &
Haspelmath, 2013), Glottolog (Hammarstrom et al., 2015) and
Phoible (Moran & McCloy, 2014). Databases such as Phoible
contain the actual segments used by various languages as they
are given in the primary language descriptions. However, this
segment-level representation cannot be used directly for analyses
that require generalizations over classes of segments that share
theoretically interesting features. Here we present a method
and the associated R (R Core Team, 2014) code that allows the
flexible definition of such meaningful classes and that can identify
the sets of segments falling into such a class for any language
inventory. The method and its results are important for those
interested in exploring cross-linguistic patterns of phonetic and
phonological diversity and their relationship to extra-linguistic
factors and processes such as climate, economics, history or
human genetics.
Introducing the SEA_ AP: an Enhanced Tool forAutomatic Prosodic Analysis
Marta Martinez, Rocio Varela, Carmen García Mateo,Elisa Fernandez Rei and Adela Martinez Calvo
SEA_ AP (Segmentador e Etiquetador Automático para
Análise Prosódica, Automatic Segmentation and Labelling for
Prosodic Analysis) toolkit is an application that performs audio
segmentation and labelling to create a TextGrid file which will be
used to launch a prosodic analysis using Praat. In this paper, we
want to describe the improved functionality of the tool achieved
by adding a dialectometric analysis module using R scripts. The
dialectometric analysis includes computing correlations among
F0 curves and it obtains prosodic distances among the different
variables of interest (location, speaker, structure, etc.). The
dialectometric analysis requires large databases in order to be
adequately computed, and automatic segmentation and labelling
can create them thanks to a procedure less costly than the manual
alternative. Thus, the integration of these tools into the SEA_
AP allows to propose a distribution of geoprosodic areas by
means of a quantitative method, which completes the traditional
dialectological point of view. The current version of the SEA_
AP toolkit is capable of analysing Galician, Spanish and Brazilian
Portuguese data, and hence the distances between several prosodic
linguistic varieties can be measured at present.
A Machine Learning based Music Retrieval andRecommendation System
Naziba Mostafa, Yan Wan, Unnayan Amitabh and PascaleFung
In this paper, we present a music retrieval and recommendation
system using machine learning techniques. We propose a query
by humming system for music retrieval that uses deep neural
networks for note transcription and a note-based retrieval system
for retrieving the correct song from the database. We evaluate
our query by humming system using the standard MIREX QBSH
dataset. We also propose a similar artist recommendation system
which recommends similar artists based on acoustic features of
the artists’ music, online text descriptions of the artists and social
media data. We use supervised machine learning techniques over
all our features and compare our recommendation results to those
produced by a popular similar artist recommendation website.
P24 - Speech Processing (1)Thursday, May 26, 9:45
Chairperson: Andrew Caines Poster Session
CirdoX: an on/off-line multisource speech andsound analysis software
Frédéric Aman, Michel Vacher, François Portet, WilliamDuclot and Benjamin Lecouteux
Vocal User Interfaces in domestic environments recently gained
interest in the speech processing community. This interest is
due to the opportunity of using it in the framework of Ambient
Assisted Living both for home automation (vocal command) and
for call for help in case of distress situations, i.e. after a fall. C
IRDO X, which is a modular software, is able to analyse online
the audio environment in a home, to extract the uttered sentences
and then to process them thanks to an ASR module. Moreover,
this system perfoms non-speech audio event classification; in this
68
case, specific models must be trained. The software is designed to
be modular and to process on-line the audio multichannel stream.
Some exemples of studies in which C IRDO X was involved are
described. They were operated in real environment, namely a
Living lab environment.
Optimizing Computer-Assisted TranscriptionQuality with Iterative User Interfaces
Matthias Sperber, Graham Neubig, Satoshi Nakamura andAlex Waibel
Computer-assisted transcription promises high-quality speech
transcription at reduced costs. This is achieved by limiting human
effort to transcribing parts for which automatic transcription
quality is insufficient. Our goal is to improve the human
transcription quality via appropriate user interface design. We
focus on iterative interfaces that allow humans to solve tasks
based on an initially given suggestion, in this case an automatic
transcription. We conduct a user study that reveals considerable
quality gains for three variations of iterative interfaces over
a non-iterative from-scratch transcription interface. Our
iterative interfaces included post-editing, confidence-enhanced
post-editing, and a novel retyping interface. All three yielded
similar quality on average, but we found that the proposed
retyping interface was less sensitive to the difficulty of the
segment, and superior when the automatic transcription of the
segment contained relatively many errors. An analysis using
mixed-effects models allows us to quantify these and other factors
and draw conclusions over which interface design should be
chosen in which circumstance.
A Framework for Collecting Realistic Recordingsof Dysarthric Speech - the homeService Corpus
Mauro Nicolao, Heidi Christensen, Stuart Cunningham,Phil Green and Thomas Hain
This paper introduces a new British English speech database,
named the homeService corpus, which has been gathered as
part of the homeService project. This project aims to help
users with speech and motor disabilities to operate their home
appliances using voice commands. The audio recorded during
such interactions consists of realistic data of speakers with severe
dysarthria. The majority of the homeService corpus is recorded
in real home environments where voice control is often the
normal means by which users interact with their devices. The
collection of the corpus is motivated by the shortage of realistic
dysarthric speech corpora available to the scientific community.
Along with the details on how the data is organised and how it
can be accessed, a brief description of the framework used to
make the recordings is provided. Finally, the performance of the
homeService automatic recogniser for dysarthric speech trained
with single-speaker data from the corpus is provided as an initial
baseline. Access to the homeService corpus is provided through
the dedicated web page at http://mini.dcs.shef.ac.
uk/resources/homeservice-corpus/. This will also
have the most updated description of the data. At the time of
writing the collection process is still ongoing.
Automatic Anomaly Detection for Dysarthriaacross Two Speech Styles: Read vs SpontaneousSpeech
Imed Laaridh, Corinne Fredouille and Christine Meunier
Perceptive evaluation of speech disorders is still the standard
method in clinical practice for the diagnosing and the following
of the condition progression of patients. Such methods include
different tasks such as read speech, spontaneous speech, isolated
words, sustained vowels, etc. In this context, automatic
speech processing tools have proven pertinence in speech quality
evaluation and assistive technology-based applications. Though,
a very few studies have investigated the use of automatic tools
on spontaneous speech. This paper investigates the behavior
of an automatic phone-based anomaly detection system when
applied on read and spontaneous French dysarthric speech. The
behavior of the automatic tool reveals interesting inter-pathology
differences across speech styles.
TTS for Low Resource Languages: A BanglaSynthesizer
Alexander Gutkin, Linne Ha, Martin Jansche, KnotPipatsrisawat and Richard Sproat
We present a text-to-speech (TTS) system designed for the dialect
of Bengali spoken in Bangladesh. This work is part of an
ongoing effort to address the needs of under-resourced languages.
We propose a process for streamlining the bootstrapping of
TTS systems for under-resourced languages. First, we use
crowdsourcing to collect the data from multiple ordinary speakers,
each speaker recording small amount of sentences. Second,
we leverage an existing text normalization system for a related
language (Hindi) to bootstrap a linguistic front-end for Bangla.
Third, we employ statistical techniques to construct multi-speaker
acoustic models using Long Short-Term Memory Recurrent
Neural Network (LSTM-RNN) and Hidden Markov Model
(HMM) approaches. We then describe our experiments that show
69
that the resulting TTS voices score well in terms of their perceived
quality as measured by Mean Opinion Score (MOS) evaluations.
Speech Trax: A Bottom to the Top Approach forSpeaker Tracking and Indexing in an ArchivingContext
Félicien Vallet, Jim Uro, Jérémy Andriamakaoly, HakimNabi, Mathieu Derval and Jean Carrive
With the increasing amount of audiovisual and digital data
deriving from televisual and radiophonic sources, professional
archives such as INA, France’s national audiovisual institute,
acknowledge a growing need for efficient indexing tools. In
this paper, we describe the Speech Trax system that aims at
analyzing the audio content of TV and radio documents. In
particular, we focus on the speaker tracking task that is very
valuable for indexing purposes. First, we detail the overall
architecture of the system and show the results obtained on a
large-scale experiment, the largest to our knowledge for this type
of content (about 1,300 speakers). Then, we present the Speech
Trax demonstrator that gathers the results of various automatic
speech processing techniques on top of our speaker tracking
system (speaker diarization, speech transcription, etc.). Finally,
we provide insight on the obtained performances and suggest hints
for future improvements.
O21 - Social MediaThursday, May 26, 11:45
Chairperson: Piek Vossen Oral Session
Web Chat Conversations from Contact Centers: aDescriptive Study
Geraldine Damnati, Aleksandra Guerraz and DelphineCharlet
In this article we propose a descriptive study of a chat
conversations corpus from an assistance contact center.
Conversations are described from several view points, including
interaction analysis, language deviation analysis and typographic
expressivity marks analysis. We provide in particular a detailed
analysis of language deviations that are encountered in our corpus
of 230 conversations, corresponding to 6879 messages and 76839
words. These deviations may be challenging for further syntactic
and semantic parsing. Analysis is performed with a distinction
between Customer messages and Agent messages. On the
overall only 4% of the observed words are misspelled but 26%
of the messages contain at least one erroneous word (rising to
40% when focused on Customer messages). Transcriptions of
telephone conversations from an assistance call center are also
studied, allowing comparisons between these two interaction
modes to be drawn. The study reveals significant differences
in terms of conversation flow, with an increased efficiency for
chat conversations in spite of longer temporal span.
Identification of Drug-Related Medical Conditionsin Social MediaFrançois Morlane-Hondère, Cyril Grouin and PierreZweigenbaum
Monitoring social media has been shown to be an interesting
approach for the early detection of drug adverse effects. In
this paper, we describe a system which extracts medical entities
in French drug reviews written by users. We focus on the
identification of medical conditions, which is based on the concept
of post-coordination: we first extract minimal medical-related
entities (pain, stomach) then we combine them to identify complex
ones (It was the worst [pain I ever felt in my stomach]). These
two steps are respectively performed by two classifiers, the first
being based on Conditional Random Fields and the second one
on Support Vector Machines. The overall results of the minimal
entity classifier are the following: P=0.926; R=0.849; F1=0.886.
A thourough analysis of the feature set shows that, when
combined with word lemmas, clusters generated by word2vec are
the most valuable features. When trained on the output of the first
classifier, the second classifier’s performances are the following:
p=0.683;r=0.956;f1=0.797. The addition of post-processing rules
did not add any significant global improvement but was found to
modify the precision/recall ratio.
Creating a Lexicon of Bavarian Dialect by Meansof Facebook Language Data and CrowdsourcingManuel Burghardt, Daniel Granvogl and Christian Wolff
Data acquisition in dialectology is typically a tedious task, as
dialect samples of spoken language have to be collected via
questionnaires or interviews. In this article, we suggest to use
the “web as a corpus” approach for dialectology. We present a
case study that demonstrates how authentic language data for the
Bavarian dialect (ISO 639-3:bar) can be collected automatically
from the social network Facebook. We also show that Facebook
can be used effectively as a crowdsourcing platform, where users
are willing to translate dialect words collaboratively in order to
create a common lexicon of their Bavarian dialect. Key insights
from the case study are summarized as “lessons learned” together
with suggestions for future enhancements of the lexicon creation
approach.
A Corpus of Wikipedia Discussions: Over theYears, with Topic, Power and Gender LabelsVinodkumar Prabhakaran and Owen Rambow
In order to gain a deep understanding of how social context
manifests in interactions, we need data that represents interactions
70
from a large community of people over a long period of time,
capturing different aspects of social context. In this paper, we
present a large corpus of Wikipedia Talk page discussions that
are collected from a broad range of topics, containing discussions
that happened over a period of 15 years. The dataset contains
166,322 discussion threads, across 1236 articles/topics that span
15 different topic categories or domains. The dataset also captures
whether the post is made by an registered user or not, and
whether he/she was an administrator at the time of making the
post. It also captures the Wikipedia age of editors in terms of
number of months spent as an editor, as well as their gender.
This corpus will be a valuable resource to investigate a variety
of computational sociolinguistics research questions regarding
online social interactions.
O22 - Anaphora and CoreferenceThursday, May 26, 11:45
Chairperson: Eva Hajicová Oral Session
Phrase Detectives Corpus 1.0 CrowdsourcedAnaphoric Coreference.
Jon Chamberlain, Massimo Poesio and Udo Kruschwitz
Natural Language Engineering tasks require large and complex
annotated datasets to build more advanced models of language.
Corpora are typically annotated by several experts to create
a gold standard; however, there are now compelling reasons
to use a non-expert crowd to annotate text, driven by cost,
speed and scalability. Phrase Detectives Corpus 1.0 is an
anaphorically-annotated corpus of encyclopedic and narrative text
that contains a gold standard created by multiple experts, as well
as a set of annotations created by a large non-expert crowd.
Analysis shows very good inter-expert agreement (kappa=.88-.93)
but a more variable baseline crowd agreement (kappa=.52-.96).
Encyclopedic texts show less agreement (and by implication are
harder to annotate) than narrative texts. The release of this corpus
is intended to encourage research into the use of crowds for text
annotation and the development of more advanced, probabilistic
language models, in particular for anaphoric coreference.
Summ-it++: an Enriched Version of the Summ-itCorpus
Evandro Fonseca, André Antonitsch, Sandra Collovini,Daniela Amaral, Renata Vieira and Anny Figueira
This paper presents Summ-it++, an enriched version the Summ-
it corpus. In this new version, the corpus has received new
semantic layers, named entity categories and relations between
named entities, adding to the previous coreference annotation. In
addition, we change the original Summ-it format to SemEval.
Towards Multiple Antecedent CoreferenceResolution in Specialized Discourse
Alicia Burga, Sergio Cajal, Joan Codina-Filba and LeoWanner
Despite the popularity of coreference resolution as a research
topic, the overwhelming majority of the work in this area focused
so far on single antecedence coreference only. Multiple antecedent
coreference (MAC) has been largely neglected. This can be
explained by the scarcity of the phenomenon of MAC in generic
discourse. However, in specialized discourse such as patents,
MAC is very dominant. It seems thus unavoidable to address
the problem of MAC resolution in the context of tasks related
to automatic patent material processing, among them abstractive
summarization, deep parsing of patents, construction of concept
maps of the inventions, etc. We present the first version of
an operational rule-based MAC resolution strategy for patent
material that covers the three major types of MAC: (i) nominal
MAC, (ii) MAC with personal / relative pronouns, and MAC with
reflexive / reciprocal pronouns. The evaluation shows that our
strategy performs well in terms of precision and recall.
ARRAU: Linguistically-Motivated Annotation ofAnaphoric Descriptions
Olga Uryupina, Ron Artstein, Antonella Bristot, FedericaCavicchio, Kepa Rodriguez and Massimo Poesio
This paper presents a second release of the ARRAU dataset:
a multi-domain corpus with thorough linguistically motivated
annotation of anaphora and related phenomena. Building upon
the first release almost a decade ago, a considerable effort had
been invested in improving the data both quantitatively and
qualitatively. Thus, we have doubled the corpus size, expanded
the selection of covered phenomena to include referentiality and
genericity and designed and implemented a methodology for
enforcing the consistency of the manual annotation. We believe
that the new release of ARRAU provides a valuable material for
ongoing research in complex cases of coreference as well as for a
variety of related tasks. The corpus is publicly available through
LDC.
71
O23 - Machine Learning and InformationExtractionThursday, May 26, 11:45
Chairperson: Feiyu Xu Oral Session
An Annotated Corpus and Method for Analysis ofAd-Hoc Structures Embedded in Text
Eric Yeh, John Niekrasz, Dayne Freitag and RichardRohwer
We describe a method for identifying and performing functional
analysis of structured regions that are embedded in natural
language documents, such as tables or key-value lists. Such
regions often encode information according to ad hoc schemas
and avail themselves of visual cues in place of natural language
grammar, presenting problems for standard information extraction
algorithms. Unlike previous work in table extraction, which
assumes a relatively noiseless two-dimensional layout, our
aim is to accommodate a wide variety of naturally occurring
structure types. Our approach has three main parts. First, we
collect and annotate a a diverse sample of “naturally” occurring
structures from several sources. Second, we use probabilistic text
segmentation techniques, featurized by skip bigrams over spatial
and token category cues, to automatically identify contiguous
regions of structured text that share a common schema. Finally,
we identify the records and fields within each structured region
using a combination of distributional similarity and sequence
alignment methods, guided by minimal supervision in the form
of a single annotated record. We evaluate the last two components
individually, and conclude with a discussion of further work.
Learning Thesaurus Relations from DistributionalFeatures
Rosa Tsegaye Aga, Christian Wartena, Lucas Drumond andLars Schmidt-Thieme
In distributional semantics words are represented by aggregated
context features. The similarity of words can be computed by
comparing their feature vectors. Thus, we can predict whether
two words are synonymous or similar with respect to some
other semantic relation. We will show on six different datasets
of pairs of similar and non-similar words that a supervised
learning algorithm on feature vectors representing pairs of words
outperforms cosine similarity between vectors representing single
words. We compared different methods to construct a feature
vector representing a pair of words. We show that simple methods
like pairwise addition or multiplication give better results than
a recently proposed method that combines different types of
features. The semantic relation we consider is relatedness of
terms in thesauri for intellectual document classification. Thus our
findings can directly be applied for the maintenance and extension
of such thesauri. To the best of our knowledge this relation was
not considered before in the field of distributional semantics.
Factuality Annotation and Learning in SpanishTextsDina Wonsever, Aiala Rosá and Marisa Malcuori
We present a proposal for the annotation of factuality of event
mentions in Spanish texts and a free available annotated corpus.
Our factuality model aims to capture a pragmatic notion of
factuality, trying to reflect a casual reader judgements about
the realis / irrealis status of mentioned events. Also, some
learning experiments (SVM and CRF) have been held, showing
encouraging results.
NNBlocks: A Deep Learning Framework forComputational Linguistics Neural NetworkModelsFrederico Tommasi Caroli, André Freitas, João CarlosPereira da Silva and Siegfried Handschuh
Lately, with the success of Deep Learning techniques in some
computational linguistics tasks, many researchers want to explore
new models for their linguistics applications. These models tend
to be very different from what standard Neural Networks look
like, limiting the possibility to use standard Neural Networks
frameworks. This work presents NNBlocks, a new framework
written in Python to build and train Neural Networks that are not
constrained by a specific kind of architecture, making it possible
to use it in computational linguistics.
O24 - -Speech Corpus for HealthThursday, May 26, 11:45
Chairperson: Eleni Efthimiou Oral Session
Automatic identification of Mild CognitiveImpairment through the analysis of Italianspontaneous speech productionsDaniela Beltrami, Laura Calzà, Gloria Gagliardi, EnricoGhidoni, Norina Marcello, Rema Rossini Favretti andFabio Tamburini
This paper presents some preliminary results of the OPLON
project. It aimed at identifying early linguistic symptoms of
cognitive decline in the elderly. This pilot study was conducted
on a corpus composed of spontaneous speech sample collected
from 39 subjects, who underwent a neuropsychological screening
for visuo-spatial abilities, memory, language, executive functions
and attention. A rich set of linguistic features was extracted from
the digitalised utterances (at phonetic, suprasegmental, lexical,
morphological and syntactic levels) and the statistical significance
72
in pinpointing the pathological process was measured. Our results
show remarkable trends for what concerns both the linguistic traits
selection and the automatic classifiers building.
On the Use of a Serious Game for Recording aSpeech Corpus of People with IntellectualDisabilities
Mario Corrales-Astorgano, David Escudero-Mancebo,Yurena Gutiérrez-González, Valle Flores-Lucas, CésarGonzález-Ferreras and Valentín Cardeñoso-Payo
This paper describes the recording of a speech corpus focused
on prosody of people with intellectual disabilities. To do this,
a video game is used with the aim of improving the user’s
motivation. Moreover, the player’s profiles and the sentences
recorded during the game sessions are described. With the
purpose of identifying the main prosodic troubles of people with
intellectual disabilities, some prosodic features are extracted from
recordings, like fundamental frequency, energy and pauses. After
that, a comparison is made between the recordings of people with
intellectual disabilities and people without intellectual disabilities.
This comparison shows that pauses are the best discriminative
feature between these groups. To check this, a study has been
done using machine learning techniques, with a classification rate
superior to 80%.
Building Language Resources for ExploringAutism Spectrum Disorders
Julia Parish-Morris, Christopher Cieri, Mark Liberman,Leila Bateman, Emily Ferguson and Robert T. Schultz
Autism spectrum disorder (ASD) is a complex
neurodevelopmental condition that would benefit from low-cost
and reliable improvements to screening and diagnosis. Human
language technologies (HLTs) provide one possible route to
automating a series of subjective decisions that currently inform
“Gold Standard” diagnosis based on clinical judgment. In this
paper, we describe a new resource to support this goal, comprised
of 100 20-minute semi-structured English language samples
labeled with child age, sex, IQ, autism symptom severity, and
diagnostic classification. We assess the feasibility of digitizing
and processing sensitive clinical samples for data sharing, and
identify areas of difficulty. Using the methods described here, we
propose to join forces with researchers and clinicians throughout
the world to establish an international repository of annotated
language samples from individuals with ASD and related
disorders. This project has the potential to improve the lives of
individuals with ASD and their families by identifying linguistic
features that could improve remote screening, inform personalized
intervention, and promote advancements in clinically-oriented
HLTs.
Vocal Pathologies Detection and MispronouncedPhonemes Identification: Case of ArabicContinuous Speech
Naim Terbeh and Mounir Zrigui
We propose in this work a novel acoustic phonetic study for
Arabic people suffering from language disabilities and non-
native learners of Arabic language to classify Arabic continuous
speech to pathological or healthy and to identify phonemes that
pose pronunciation problems (case of pathological speeches).
The main idea can be summarized in comparing between the
phonetic model reference to Arabic spoken language and that
proper to concerned speaker. For this task, we use techniques of
automatic speech processing like forced alignment and artificial
neural network (ANN) (Basheer, 2000). Based on a test corpus
containing 100 speech sequences, recorded by different speakers
(healthy/pathological speeches and native/foreign speakers), we
attain 97% as classification rate. Algorithms used in identifying
phonemes that pose pronunciation problems show high efficiency:
we attain an identification rate of 100%.
P25 - CrowdsourcingThursday, May 26, 11:45
Chairperson: Monica Monachini Poster Session
Wikipedia Titles As Noun Tag Predictors
Armin Hoenen
In this paper, we investigate a covert labeling cue, namely the
probability that a title (by example of the Wikipedia titles) is
a noun. If this probability is very large, any list such as or
comparable to the Wikipedia titles can be used as a reliable
word-class (or part-of-speech tag) predictor or noun lexicon.
This may be especially useful in the case of Low Resource
Languages (LRL) where labeled data is lacking and putatively for
Natural Language Processing (NLP) tasks such as Word Sense
Disambiguation, Sentiment Analysis and Machine Translation.
Profitting from the ease of digital publication on the web as
opposed to print, LRL speaker communities produce resources
such as Wikipedia and Wiktionary, which can be used for an
assessment. We provide statistical evidence for a strong noun
bias for the Wikipedia titles from 2 corpora (English, Persian)
and a dictionary (Japanese) and for a typologically balanced set of
17 languages including LRLs. Additionally, we conduct a small
73
experiment on predicting noun tags for out-of-vocabulary items in
part-of-speech tagging for English.
Japanese Word–Color Associations with andwithout Contexts
Jun Harashima
Although some words carry strong associations with specific
colors (e.g., the word danger is associated with the color red), few
studies have investigated these relationships. This may be due
to the relative rarity of databases that contain large quantities of
such information. Additionally, these resources are often limited
to particular languages, such as English. Moreover, the existing
resources often do not consider the possible contexts of words
in assessing the associations between a word and a color. As a
result, the influence of context on word–color associations is not
fully understood. In this study, we constructed a novel language
resource for word–color associations. The resource has two
characteristics: First, our resource is the first to include Japanese
word–color associations, which were collected via crowdsourcing.
Second, the word–color associations in the resource are linked
to contexts. We show that word–color associations depend on
language and that associations with certain colors are affected by
context information.
The VU Sound Corpus: Adding MoreFine-grained Annotations to the FreesoundDatabase
Emiel van Miltenburg, Benjamin Timmermans and LoraAroyo
This paper presents a collection of annotations (tags or keywords)
for a set of 2,133 environmental sounds taken from the Freesound
database (www.freesound.org). The annotations are acquired
through an open-ended crowd-labeling task, in which participants
were asked to provide keywords for each of three sounds. The
main goal of this study is to find out (i) whether it is feasible
to collect keywords for a large collection of sounds through
crowdsourcing, and (ii) how people talk about sounds, and what
information they can infer from hearing a sound in isolation. Our
main finding is that it is not only feasible to perform crowd-
labeling for a large collection of sounds, it is also very useful to
highlight different aspects of the sounds that authors may fail to
mention. Our data is freely available, and can be used to ground
semantic models, improve search in audio databases, and to study
the language of sound.
Crowdsourcing a Large Dataset ofDomain-Specific Context-Sensitive Semantic VerbRelations
Maria Sukhareva, Judith Eckle-Kohler, Ivan Habernal andIryna Gurevych
We present a new large dataset of 12403 context-sensitive verb
relations manually annotated via crowdsourcing. These relations
capture fine-grained semantic information between verb-centric
propositions, such as temporal or entailment relations. We
propose a novel semantic verb relation scheme and design a
multi-step annotation approach for scaling-up the annotations
using crowdsourcing. We employ several quality measures and
report on agreement scores. The resulting dataset is available
under a permissive CreativeCommons license at www.ukp.tu-
darmstadt.de/data/verb-relations/. It represents a valuable
resource for various applications, such as automatic information
consolidation or automatic summarization.
Acquiring Opposition Relations among ItalianVerb Senses using Crowdsourcing
Anna Feltracco, Simone Magnolini, Elisabetta Jezek andBernardo Magnini
We describe an experiment for the acquisition of opposition
relations among Italian verb senses, based on a crowdsourcing
methodology. The goal of the experiment is to discuss whether
the types of opposition we distinguish (i.e. complementarity,
antonymy, converseness and reversiveness) are actually perceived
by the crowd. In particular, we collect data for Italian by using
the crowdsourcing platform CrowdFlower. We ask annotators to
judge the type of opposition existing among pairs of sentences
-previously judged as opposite- that differ only for a verb: the verb
in the first sentence is opposite of the verb in second sentence.
Data corroborate the hypothesis that some opposition relations
exclude each other, while others interact, being recognized as
compatible by the contributors.
Crowdsourcing a Multi-lingual Speech Corpus:Recording, Transcription and Annotation of theCrowdIS Corpora
Andrew Caines, Christian Bentz, Calbert Graham, TimPolzehl and Paula Buttery
We announce the release of the CROWDED CORPUS: a pair
of speech corpora collected via crowdsourcing, containing a
native speaker corpus of English (CROWDED_ ENGLISH),
and a corpus of German/English bilinguals (CROWDED_
74
BILINGUAL). Release 1 of the CROWDED CORPUS contains
1000 recordings amounting to 33,400 tokens collected from
80 speakers and is freely available to other researchers. We
recruited participants via the Crowdee application for Android.
Recruits were prompted to respond to business-topic questions
of the type found in language learning oral tests. We then
used the CrowdFlower web application to pass these recordings
to crowdworkers for transcription and annotation of errors and
sentence boundaries. Finally, the sentences were tagged and
parsed using standard natural language processing tools. We
propose that crowdsourcing is a valid and economical method for
corpus collection, and discuss the advantages and disadvantages
of this approach.
The REAL Corpus: A Crowd-Sourced Corpus ofHuman Generated and Evaluated SpatialReferences to Real-World Urban Scenes
Phil Bartie, William Mackaness, Dimitra Gkatzia andVerena Rieser
Our interest is in people’s capacity to efficiently and effectively
describe geographic objects in urban scenes. The broader
ambition is to develop spatial models capable of equivalent
functionality able to construct such referring expressions. To
that end we present a newly crowd-sourced data set of natural
language references to objects anchored in complex urban scenes
(In short: The REAL Corpus – Referring Expressions Anchored
Language). The REAL corpus contains a collection of images
of real-world urban scenes together with verbal descriptions of
target objects generated by humans, paired with data on how
successful other people were able to identify the same object based
on these descriptions. In total, the corpus contains 32 images
with on average 27 descriptions per image and 3 verifications
for each description. In addition, the corpus is annotated with a
variety of linguistically motivated features. The paper highlights
issues posed by collecting data using crowd-sourcing with an
unrestricted input format, as well as using real-world urban
scenes.
Introducing the Weighted Trustability Evaluatorfor Crowdsourcing Exemplified by SpeakerLikability Classification
Simone Hantke, Erik Marchi and Björn Schuller
Crowdsourcing is an arising collaborative approach applicable
among many other applications to the area of language and
speech processing. In fact, the use of crowdsourcing was
already applied in the field of speech processing with promising
results. However, only few studies investigated the use
of crowdsourcing in computational paralinguistics. In this
contribution, we propose a novel evaluator for crowdsourced-
based ratings termed Weighted Trustability Evaluator (WTE)
which is computed from the rater-dependent consistency over
the test questions. We further investigate the reliability of
crowdsourced annotations as compared to the ones obtained with
traditional labelling procedures, such as constrained listening
experiments in laboratories or in controlled environments.
This comparison includes an in-depth analysis of obtainable
classification performances. The experiments were conducted
on the Speaker Likability Database (SLD) already used in the
INTERSPEECH Challenge 2012, and the results lend further
weight to the assumption that crowdsourcing can be applied as a
reliable annotation source for computational paralinguistics given
a sufficient number of raters and suited measurements of their
reliability.
P26 - Emotion Recognition/GenerationThursday, May 26, 11:45
Chairperson: Saif Mohammad Poster Session
Comparison of Emotional Understanding inModality-Controlled Environments usingMultimodal Online Emotional CommunicationCorpus
Yoshiko Arimoto and Kazuo Okanoya
In online computer-mediated communication, speakers were
considered to have experienced difficulties in catching their
partner’s emotions and in conveying their own emotions. To
explain why online emotional communication is so difficult and to
investigate how this problem should be solved, multimodal online
emotional communication corpus was constructed by recording
approximately 100 speakers’ emotional expressions and reactions
in a modality-controlled environment. Speakers communicated
over the Internet using video chat, voice chat or text chat; their
face-to-face conversations were used for comparison purposes.
The corpora incorporated emotional labels by evaluating the
speaker’s dynamic emotional states and the measurements of
the speaker’s facial expression, vocal expression and autonomic
nervous system activity. For the initial study of this project,
which used a large-scale emotional communication corpus, the
accuracy of online emotional understanding was assessed to
demonstrate the emotional labels evaluated by the speakers
and to summarize the speaker’s answers on the questionnaire
regarding the difference between an online chat and face-to-
face conversations in which they actually participated. The
results revealed that speakers have difficulty communicating their
emotions in online communication environments, regardless of
the type of communication modality and that inaccurate emotional
75
understanding occurs more frequently in online computer-
mediated communication than in face-to-face communication.
Laughter in French Spontaneous ConversationalDialogs
Brigitte Bigi and Roxane Bertrand
This paper presents a quantitative description of laughter in height
1-hour French spontaneous conversations. The paper includes the
raw figures for laughter as well as more details concerning inter-
individual variability. It firstly describes to what extent the amount
of laughter and their durations varies from speaker to speaker in
all dialogs. In a second suite of analyses, this paper compares
our corpus with previous analyzed corpora. In a final set of
experiments, it presents some facts about overlapping laughs. This
paper have quantified these all effects in free-style conversations,
for the first time.
AVAB-DBS: an Audio-Visual Affect BurstsDatabase for Synthesis
Kevin El Haddad, Huseyin Cakmak, Stéphane Dupont andThierry Dutoit
It has been shown that adding expressivity and emotional
expressions to an agent’s communication systems would improve
the interaction quality between this agent and a human user. In
this paper we present a multimodal database of affect bursts,
which are very short non-verbal expressions with facial, vocal, and
gestural components that are highly synchronized and triggered by
an identifiable event. This database contains motion capture and
audio data of affect bursts representing disgust, startle and surprise
recorded at three different levels of arousal each. This database is
to be used for synthesis purposes in order to generate affect bursts
of these emotions on a continuous arousal level scale.
Construction of Japanese Audio-Visual EmotionDatabase and Its Application in EmotionRecognition
Nurul Lubis, Randy Gomez, Sakriani Sakti, KeisukeNakamura, Koichiro Yoshino, Satoshi Nakamura andKazuhiro Nakadai
Emotional aspects play a vital role in making human
communication a rich and dynamic experience. As we introduce
more automated system in our daily lives, it becomes increasingly
important to incorporate emotion to provide as natural an
interaction as possible. To achieve said incorporation, rich sets
of labeled emotional data is prerequisite. However, in Japanese,
existing emotion database is still limited to unimodal and bimodal
corpora. Since emotion is not only expressed through speech,
but also visually at the same time, it is essential to include
multiple modalities in an observation. In this paper, we present
the first audio-visual emotion corpora in Japanese, collected from
14 native speakers. The corpus contains 100 minutes of annotated
and transcribed material. We performed preliminary emotion
recognition experiments on the corpus and achieved an accuracy
of 61.42% for five classes of emotion.
Evaluating Context Selection Strategies to BuildEmotive Vector Space Models
Lucia C. Passaro and Alessandro Lenci
In this paper we compare different context selection approaches
to improve the creation of Emotive Vector Space Models (VSMs).
The system is based on the results of an existing approach that
showed the possibility to create and update VSMs by exploiting
crowdsourcing and human annotation. Here, we introduce a
method to manipulate the contexts of the VSMs under the
assumption that the emotive connotation of a target word is a
function of both its syntagmatic and paradigmatic association
with the various emotions. To study the differences among the
proposed spaces and to confirm the reliability of the system, we
report on two experiments: in the first one we validated the
best candidates extracted from each model, and in the second
one we compared the models’ performance on a random sample
of target words. Both experiments have been implemented as
crowdsourcing tasks.
P27 - Machine Translation (2)Thursday, May 26, 11:45
Chairperson: Aljoscha Burchardt Poster Session
Simultaneous Sentence Boundary Detection andAlignment with Pivot-based Machine TranslationGenerated Lexicons
Antoine Bourlon, Chenhui Chu, Toshiaki Nakazawa andSadao Kurohashi
Sentence alignment is a task that consists in aligning the parallel
sentences in a translated article pair. This paper describes a
method to perform sentence boundary detection and alignment
simultaneously, which significantly improves the alignment
accuracy on languages like Chinese with uncertain sentence
boundaries. It relies on the definition of hard (certain) and
soft (uncertain) punctuation delimiters, the latter being possibly
ignored to optimize the alignment result. The alignment method
is used in combination with lexicons automatically generated
from the input article pairs using pivot-based MT, achieving
better coverage of the input words with fewer entries than pre-
existing dictionaries. Pivot-based MT makes it possible to build
dictionaries for language pairs that have scarce parallel data. The
76
alignment method is implemented in a tool that will be freely
available in the near future.
That’ll Do Fine!: A Coarse Lexical Resource forEnglish-Hindi MT, Using Polylingual TopicModels
Diptesh Kanojia, Aditya Joshi, Pushpak Bhattacharyya andMark James Carman
Parallel corpora are often injected with bilingual lexical resources
for improved Indian language machine translation (MT). In
absence of such lexical resources, multilingual topic models
have been used to create coarse lexical resources in the past,
using a Cartesian product approach. Our results show that
for morphologically rich languages like Hindi, the Cartesian
product approach is detrimental for MT. We then present a novel
‘sentential’ approach to use this coarse lexical resource from
a multilingual topic model. Our coarse lexical resource when
injected with a parallel corpus outperforms a system trained
using parallel corpus and a good quality lexical resource. As
demonstrated by the quality of our coarse lexical resource and
its benefit to MT, we believe that our sentential approach to
create such a resource will help MT for resource-constrained
languages.
ASPEC: Asian Scientific Paper Excerpt Corpus
Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto,Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi andHitoshi Isahara
In this paper, we describe the details of the ASPEC (Asian
Scientific Paper Excerpt Corpus), which is the first large-
size parallel corpus of scientific paper domain. ASPEC
was constructed in the Japanese-Chinese machine translation
project conducted between 2006 and 2010 using the Special
Coordination Funds for Promoting Science and Technology.
It consists of a Japanese-English scientific paper abstract
corpus of approximately 3 million parallel sentences (ASPEC-
JE) and a Chinese-Japanese scientific paper excerpt corpus
of approximately 0.68 million parallel sentences (ASPEC-JC).
ASPEC is used as the official dataset for the machine translation
evaluation workshop WAT (Workshop on Asian Translation).
Domain Adaptation in MT Using Titles inWikipedia as a Parallel Corpus: Resources andEvaluation
Gorka Labaka, Iñaki Alegria and Kepa Sarasola
This paper presents how an state-of-the-art SMT system is
enriched by using an extra in-domain parallel corpora extracted
from Wikipedia. We collect corpora from parallel titles and from
parallel fragments in comparable articles from Wikipedia. We
carried out an evaluation with a double objective: evaluating
the quality of the extracted data and evaluating the improvement
due to the domain-adaptation. We think this can be very useful
for languages with limited amount of parallel corpora, where in-
domain data is crucial to improve the performance of MT sytems.
The experiments on the Spanish-English language pair improve a
baseline trained with the Europarl corpus in more than 2 points of
BLEU when translating in the Computer Science domain.
ProphetMT: A Tree-based SMT-driven ControlledLanguage Authoring/Post-Editing ToolXiaofeng Wu, Jinhua Du, Qun Liu and Andy Way
This paper presents ProphetMT, a tree-based SMT-driven
Controlled Language (CL) authoring and post-editing tool.
ProphetMT employs the source-side rules in a translation model
and provides them as auto-suggestions to users. Accordingly, one
might say that users are writing in a Controlled Language that
is understood by the computer. ProphetMT also allows users
to easily attach structural information as they compose content.
When a specific rule is selected, a partial translation is promptly
generated on-the-fly with the help of the structural information.
Our experiments conducted on English-to-Chinese show that
our proposed ProphetMT system can not only better regularise
an author’s writing behaviour, but also significantly improve
translation fluency which is vital to reduce the post-editing time.
Additionally, when the writing and translation process is over,
ProphetMT can provide an effective colour scheme to further
improve the productivity of post-editors by explicitly featuring the
relations between the source and target rules.
Towards producing bilingual lexica frommonolingual corporaJingyi Han and Núria Bel
Bilingual lexica are the basis for many cross-lingual natural
language processing tasks. Recent works have shown success in
learning bilingual dictionary by taking advantages of comparable
corpora and a diverse set of signals derived from monolingual
corpora. In the present work, we describe an approach to
automatically learn bilingual lexica by training a supervised
classifier using word embedding-based vectors of only a few
hundred translation equivalent word pairs. The word embedding
representations of translation pairs were obtained from source and
target monolingual corpora, which are not necessarily related.
Our classifier is able to predict whether a new word pair is
under a translation relation or not. We tested it on two quite
distinct language pairs Chinese-Spanish and English-Spanish.
The classifiers achieved more than 0.90 precision and recall for
both language pairs in different evaluation scenarios. These results
show a high potential for this method to be used in bilingual lexica
77
production for language pairs with reduced amount of parallel or
comparable corpora, in particular for phrase table expansion in
Statistical Machine Translation systems.
First Steps Towards Coverage-Based SentenceAlignment
Luís Gomes and Gabriel Pereira Lopes
In this paper, we introduce a coverage-based scoring function that
discriminates between parallel and non-parallel sentences. When
plugged into Bleualign, a state-of-the-art sentence aligner, our
function improves both precision and recall of alignments over the
originally proposed BLEU score. Furthermore, since our scoring
function uses Moses phrase tables directly we avoid the need to
translate the texts to be aligned, which is time-consuming and a
potential source of alignment errors.
Using the TED Talks to Evaluate SpokenPost-editing of Machine Translation
Jeevanthi Liyanapathirana and Andrei Popescu-Belis
This paper presents a solution to evaluate spoken post-editing
of imperfect machine translation output by a human translator.
We compare two approaches to the combination of machine
translation (MT) and automatic speech recognition (ASR): a
heuristic algorithm and a machine learning method. To obtain a
data set with spoken post-editing information, we use the French
version of TED talks as the source texts submitted to MT, and
the spoken English counterparts as their corrections, which are
submitted to an ASR system. We experiment with various levels
of artificial ASR noise and also with a state-of-the-art ASR
system. The results show that the combination of MT with ASR
improves over both individual outputs of MT and ASR in terms of
BLEU scores, especially when ASR performance is low.
Phrase Level Segmentation and Labelling ofMachine Translation Errors
Frédéric Blain, Varvara Logacheva and Lucia Specia
This paper presents our work towards a novel approach for Quality
Estimation (QE) of machine translation based on sequences of
adjacent words, the so-called phrases. This new level of QE
aims to provide a natural balance between QE at word and
sentence-level, which are either too fine grained or too coarse
levels for some applications. However, phrase-level QE implies
an intrinsic challenge: how to segment a machine translation into
sequence of words (contiguous or not) that represent an error.
We discuss three possible segmentation strategies to automatically
extract erroneous phrases. We evaluate these strategies against
annotations at phrase-level produced by humans, using a new
dataset collected for this purpose.
SubCo: A Learner Translation Corpus of Humanand Machine Subtitles
José Manuel Martínez Martínez and Mihaela Vela
In this paper, we present a freely available corpus of human and
automatic translations of subtitles. The corpus comprises, the
original English subtitles (SRC), both human (HT) and machine
translations (MT) into German, as well as post-editions (PE) of
the MT output. HT and MT are annotated with errors. Moreover,
human evaluation is included in HT, MT, and PE. Such a corpus
is a valuable resource for both human and machine translation
communities, enabling the direct comparison – in terms of errors
and evaluation – between human and machine translations and
post-edited machine translations.
P28 - Multiword ExpressionsThursday, May 26, 11:45
Chairperson: Irina Temnikova Poster Session
Towards Lexical Encoding of Multi-WordExpressions in Spanish Dialects
Diana Bogantes, Eric Rodríguez, Alejandro Arauco,Alejandro Rodríguez and Agata Savary
This paper describes a pilot study in lexical encoding of multi-
word expressions (MWEs) in 4 Latin American dialects of
Spanish: Costa Rican, Colombian, Mexican and Peruvian. We
describe the variability of MWE usage across dialects. We adapt
an existing data model to a dialect-aware encoding, so as to
represent dialect-related specificities, while avoiding redundancy
of the data common for all dialects. A dozen of linguistic
properties of MWEs can be expressed in this model, both on
the level of a whole MWE and of its individual components.
We describe the resulting lexical resource containing several
dozens of MWEs in four dialects and we propose a method
for constructing a web corpus as a support for crowdsourcing
examples of MWE occurrences. The resource is available under
an open license and paves the way towards a large-scale dialect-
aware language resource construction, which should prove useful
in both traditional and novel NLP applications.
JATE 2.0: Java Automatic Term Extraction withApache Solr
Ziqi Zhang, Jie Gao and Fabio Ciravegna
Automatic Term Extraction (ATE) or Recognition (ATR) is a
fundamental processing step preceding many complex knowledge
78
engineering tasks. However, few methods have been implemented
as public tools and in particular, available as open-source
freeware. Further, little effort is made to develop an adaptable
and scalable framework that enables customization, development,
and comparison of algorithms under a uniform environment.
This paper introduces JATE 2.0, a complete remake of the free
Java Automatic Term Extraction Toolkit (Zhang et al., 2008)
delivering new features including: (1) highly modular, adaptable
and scalable ATE thanks to integration with Apache Solr, the open
source free-text indexing and search platform; (2) an extended
collection of state-of-the-art algorithms. We carry out experiments
on two well-known benchmarking datasets and compare the
algorithms along the dimensions of effectiveness (precision) and
efficiency (speed and memory consumption). To the best of our
knowledge, this is by far the only free ATE library offering a
flexible architecture and the most comprehensive collection of
algorithms.
A lexicon of perception for the identification ofsynaesthetic metaphors in corpora
Francesca Strik Lievers and Chu-Ren Huang
Synaesthesia is a type of metaphor associating linguistic
expressions that refer to two different sensory modalities.
Previous studies, based on the analysis of poetic texts, have
shown that synaesthetic transfers tend to go from the lower toward
the higher senses (e.g., sweet music vs. musical sweetness).
In non-literary language synaesthesia is rare, and finding a
sufficient number of examples manually would be too time-
consuming. In order to verify whether the directionality also
holds for conventional synaesthesia found in non-literary texts,
an automatic procedure for the identification of instances of
synaesthesia is therefore highly desirable. In this paper, we first
focus on the preliminary step of this procedure, that is, the creation
of a controlled lexicon of perception. Next, we present the results
of a small pilot study that applies the extraction procedure to
English and Italian corpus data.
TermoPL - a Flexible Tool for TerminologyExtraction
Malgorzata Marciniak, Agnieszka Mykowiecka and PiotrRychlik
The purpose of this paper is to introduce the TermoPL tool created
to extract terminology from domain corpora in Polish. The
program extracts noun phrases, term candidates, with the help
of a simple grammar that can be adapted for user’s needs. It
applies the C-value method to rank term candidates being either
the longest identified nominal phrases or their nested subphrases.
The method operates on simplified base forms in order to unify
morphological variants of terms and to recognize their contexts.
We support the recognition of nested terms by word connection
strength which allows us to eliminate truncated phrases from the
top part of the term list. The program has an option to convert
simplified forms of phrases into correct phrases in the nominal
case. TermoPL accepts as input morphologically annotated and
disambiguated domain texts and creates a list of terms, the top
part of which comprises domain terminology. It can also compare
two candidate term lists using three different coefficients showing
asymmetry of term occurrences in this data.
GhoSt-NN: A Representative Gold Standard ofGerman Noun-Noun Compounds
Sabine Schulte im Walde, Anna Hätty, Stefan Bott and NanaKhvtisavrishvili
This paper presents a novel gold standard of German
noun-noun compounds (Ghost-NN) including 868 compounds
annotated with corpus frequencies of the compounds and their
constituents, productivity and ambiguity of the constituents,
semantic relations between the constituents, and compositionality
ratings of compound-constituent pairs. Moreover, a subset
of the compounds containing 180 compounds is balanced for
the productivity of the modifiers (distinguishing low/mid/high
productivity) and the ambiguity of the heads (distinguishing
between heads with 1, 2 and >2 senses
DeQue: A Lexicon of Complex Prepositions andConjunctions in French
Carlos Ramisch, Alexis Nasr, André Valli and JoséDeulofeu
We introduce DeQue, a lexicon covering French complex
prepositions (CPRE) like “à partir de” (from) and complex
conjunctions (CCONJ) like “bien que” (although). The lexicon
includes fine-grained linguistic description based on empirical
evidence. We describe the general characteristics of CPRE and
CCONJ in French, with special focus on syntactic ambiguity.
Then, we list the selection criteria used to build the lexicon and the
corpus-based methodology employed to collect entries. Finally,
we quantify the ambiguity of each construction by annotating
around 100 sentences randomly taken from the FRWaC. In
addition to its theoretical value, the resource has many potential
practical applications. We intend to employ DeQue for treebank
79
annotation and to train a dependency parser that can takes complex
constructions into account.
PARSEME Survey on MWE Resources
Gyri Smørdal Losnegaard, Federico Sangati, Carla ParraEscartín, Agata Savary, Sascha Bargmann and JohannaMonti
This paper summarizes the preliminary results of an ongoing
survey on multiword resources carried out within the IC1207
Cost Action PARSEME (PARSing and Multi-word Expressions).
Despite the availability of language resource catalogs and the
inventory of multiword datasets on the SIGLEX-MWE website,
multiword resources are scattered and difficult to find. In many
cases, language resources such as corpora, treebanks, or lexical
databases include multiwords as part of their data or take them
into account in their annotations. However, these resources need
to be centralized to make them accessible. The aim of this
survey is to create a portal where researchers can easily find
multiword(-aware) language resources for their research. We
report on the design of the survey and analyze the data gathered
so far. We also discuss the problems we have detected upon
examination of the data as well as possible ways of enhancing
the survey.
Multiword Expressions in Child Language
Rodrigo Wilkens, Marco Idiart and Aline Villavicencio
The goal of this work is to introduce CHILDES-MWE, which
contains English CHILDES corpora automatically annotated with
Multiword Expressions (MWEs) information. The result is a
resource with almost 350,000 sentences annotated with more than
70,000 distinct MWEs of various types from both longitudinal
and latitudinal corpora. This resource can be used for large
scale language acquisition studies of how MWEs feature in child
language. Focusing on compound nouns (CN), we then verify
in a longitudinal study if there are differences in the distribution
and compositionality of CNs in child-directed and child-produced
sentences across ages. Moreover, using additional latitudinal data,
we investigate if there are further differences in CN usage and in
compositionality preferences. The results obtained for the child-
produced sentences reflect CN distribution and compositionality
in child-directed sentences.
Transfer-Based Learning-to-Rank Assessment ofMedical Term Technicality
Dhouha Bouamor, Leonardo Campillos Llanos, Anne-Laure Ligozat, Sophie Rosset and Pierre Zweigenbaum
While measuring the readability of texts has been a long-standing
research topic, assessing the technicality of terms has only been
addressed more recently and mostly for the English language.
In this paper, we train a learning-to-rank model to determine a
specialization degree for each term found in a given list. Since
no training data for this task exist for French, we train our system
with non-lexical features on English data, namely, the Consumer
Health Vocabulary, then apply it to French. The features
include the likelihood ratio of the term based on specialized and
lay language models, and tests for containing morphologically
complex words. The evaluation of this approach is conducted on
134 terms from the UMLS Metathesaurus and 868 terms from the
Eugloss thesaurus. The Normalized Discounted Cumulative Gain
obtained by our system is over 0.8 on both test sets. Besides,
thanks to the learning-to-rank approach, adding morphological
features to the language model features improves the results on
the Eugloss thesaurus.
Example-based Acquisition of Fine-grainedCollocation ResourcesSara Rodríguez-Fernández, Roberto Carlini, Luis EspinosaAnke and Leo Wanner
Collocations such as “heavy rain” or “make [a] decision”, are
combinations of two elements where one (the base) is freely
chosen, while the choice of the other (collocate) is restricted,
depending on the base. Collocations present difficulties even
to advanced language learners, who usually struggle to find the
right collocate to express a particular meaning, e.g., both “heavy”
and “strong” express the meaning ’intense’, but while “rain”
selects “heavy”, “wind” selects “strong”. Lexical Functions
(LFs) describe the meanings that hold between the elements of
collocations, such as ’intense’, ’perform’, ’create’, ’increase’,
etc. Language resources with semantically classified collocations
would be of great help for students, however they are expensive
to build, since they are manually constructed, and scarce. We
present an unsupervised approach to the acquisition and semantic
classification of collocations according to LFs, based on word
embeddings in which, given an example of a collocation for each
of the target LFs and a set of bases, the system retrieves a list of
collocates for each base and LF.
MWEs in Treebanks: From Survey to GuidelinesVictoria Rosén, Koenraad De Smedt, Gyri SmørdalLosnegaard, Eduard Bejcek, Agata Savary and PetyaOsenova
By means of an online survey, we have investigated ways in
which various types of multiword expressions are annotated in
existing treebanks. The results indicate that there is considerable
variation in treatments across treebanks and thereby also, to
some extent, across languages and across theoretical frameworks.
The comparison is focused on the annotation of light verb
constructions and verbal idioms. The survey shows that the light
80
verb constructions either get special annotations as such, or are
treated as ordinary verbs, while VP idioms are handled through
different strategies. Based on insights from our investigation,
we propose some general guidelines for annotating multiword
expressions in treebanks. The recommendations address the
following application-based needs: distinguishing MWEs from
similar but compositional constructions; searching distinct types
of MWEs in treebanks; awareness of literal and nonliteral
meanings; and normalization of the MWE representation. The
cross-lingually and cross-theoretically focused survey is intended
as an aid to accessing treebanks and an aid for further work on
treebank annotation.
Multiword Expressions Dataset for IndianLanguages
Dhirendra Singh, Sudha Bhingardive and PushpakBhattacharya
Multiword Expressions (MWEs) are used frequently in natural
languages, but understanding the diversity in MWEs is one of the
open problem in the area of Natural Language Processing. In the
context of Indian languages, MWEs play an important role. In
this paper, we present MWEs annotation dataset created for Indian
languages viz., Hindi and Marathi. We extract possible MWE
candidates using two repositories: 1) the POS-tagged corpus and
2) the IndoWordNet synsets. Annotation is done for two types
of MWEs: compound nouns and light verb constructions. In the
process of annotation, human annotators tag valid MWEs from
these candidates based on the standard guidelines provided to
them. We obtained 3178 compound nouns and 2556 light verb
constructions in Hindi and 1003 compound nouns and 2416 light
verb constructions in Marathi using two repositories mentioned
before. This created resource is made available publicly and can
be used as a gold standard for Hindi and Marathi MWE systems.
P29 - Treebanks (2)Thursday, May 26, 11:45
Chairperson: Claire Bonial Poster Session
MarsaGram: an excursion in the forests ofparsing trees
Philippe Blache, Stéphane Rauzy and Grégoire Montcheuil
The question of how to compare languages and more generally
the domain of linguistic typology, relies on the study of
different linguistic properties or phenomena. Classically, such
a comparison is done semi-manually, for example by extracting
information from databases such as the WALS. However, it
remains difficult to identify precisely regular parameters, available
for different languages, that can be used as a basis towards
modeling. We propose in this paper, focusing on the question
of syntactic typology, a method for automatically extracting
such parameters from treebanks, bringing them into a typology
perspective. We present the method and the tools for inferring
such information and navigating through the treebanks. The
approach has been applied to 10 languages of the Universal
Dependencies Treebank. We approach is evaluated by showing
how automatic classification correlates with language families.
EasyTree: A Graphical Tool for Dependency TreeAnnotation
Alexa Little and Stephen Tratz
This paper introduces EasyTree, a dynamic graphical tool for
dependency tree annotation. Built in JavaScript using the popular
D3 data visualization library, EasyTree allows annotators to
construct and label trees entirely by manipulating graphics, and
then export the corresponding data in JSON format. Human users
are thus able to annotate in an intuitive way without compromising
the machine-compatibility of the output. EasyTree has a number
of features to assist annotators, including color-coded part-of-
speech indicators and optional translation displays. It can also
be customized to suit a wide range of projects; part-of-speech
categories, edge labels, and many other settings can be edited
from within the GUI. The system also utilizes UTF-8 encoding
and properly handles both left-to-right and right-to-left scripts.
By providing a user-friendly annotation tool, we aim to reduce
time spent transforming data or learning to use the software,
to improve the user experience for annotators, and to make
annotation approachable even for inexperienced users. Unlike
existing solutions, EasyTree is built entirely with standard web
technologies–JavaScript, HTML, and CSS–making it ideal for
web-based annotation efforts, including crowdsourcing efforts.
Hypergraph Modelization of a SyntacticallyAnnotated English Wikipedia Dump
Edmundo Pavel Soriano Morales, Julien Ah-Pine andSabine Loudcher
Wikipedia, the well known internet encyclopedia, is nowadays
a widely used source of information. To leverage its rich
information, already parsed versions of Wikipedia have been
proposed. We present an annotated dump of the English
Wikipedia. This dump draws upon previously released Wikipedia
parsed dumps. Still, we head in a different direction. In
this parse we focus more into the syntactical characteristics of
words: aside from the classical Part-of-Speech (PoS) tags and
dependency parsing relations, we provide the full constituent
81
parse branch for each word in a succinct way. Additionally,
we propose a hypergraph network representation of the extracted
linguistic information. The proposed modelization aims to take
advantage of the information stocked within our parsed Wikipedia
dump. We hope that by releasing these resources, researchers
from the concerned communities will have a ready-to-experiment
Wikipedia corpus to compare and distribute their work. We render
public our parsed Wikipedia dump as well as the tool (and its
source code) used to perform the parse. The hypergraph network
and its related metadata is also distributed.
Detecting Annotation Scheme Variation inOut-of-Domain TreebanksYannick Versley and Julius Steen
To ensure portability of NLP systems across multiple domains,
existing treebanks are often extended by adding trees from
interesting domains that were not part of the initial annotation
effort. In this paper, we will argue that it is both useful
from an application viewpoint and enlightening from a linguistic
viewpoint to detect and reduce divergence in annotation schemes
between extant and new parts in a set of treebanks that is to be
used in evaluation experiments. The results of our correction and
harmonization efforts will be made available to the public as a test
suite for the evaluation of constituent parsing.
Universal Dependencies for PersianMojgan Seraji, Filip Ginter and Joakim Nivre
The Persian Universal Dependency Treebank (Persian UD) is a
recent effort of treebanking Persian with Universal Dependencies
(UD), an ongoing project that designs unified and cross-
linguistically valid grammatical representations including part-of-
speech tags, morphological features, and dependency relations.
The Persian UD is the converted version of the Uppsala Persian
Dependency Treebank (UPDT) to the universal dependencies
framework and consists of nearly 6,000 sentences and 152,871
word tokens with an average sentence length of 25 words.
In addition to the universal dependencies syntactic annotation
guidelines, the two treebanks differ in tokenization. All words
containing unsegmented clitics (pronominal and copula clitics)
annotated with complex labels in the UPDT have been separated
from the clitics and appear with distinct labels in the Persian UD.
The treebank has its original syntactic annotation scheme based
on Stanford Typed Dependencies. In this paper, we present the
approaches taken in the development of the Persian UD.
Hard Time Parsing Questions: Building aQuestionBank for FrenchDjamé Seddah and Marie Candito
We present the French Question Bank, a treebank of 2600
questions. We show that classical parsing model performance
drop while the inclusion of this data set is highly beneficial
without harming the parsing of non-question data. when facing
out-of- domain data with strong structural diver- gences. Two
thirds being aligned with the QB (Judge et al., 2006) and being
freely available, this treebank will prove useful to build robust
NLP systems.
Enhanced English Universal Dependencies: AnImproved Representation for Natural LanguageUnderstanding Tasks
Sebastian Schuster and Christopher D. Manning
Many shallow natural language understanding tasks use
dependency trees to extract relations between content words.
However, strict surface-structure dependency trees tend to follow
the linguistic structure of sentences too closely and frequently fail
to provide direct relations between content words. To mitigate
this problem, the original Stanford Dependencies representation
also defines two dependency graph representations which contain
additional and augmented relations that explicitly capture
otherwise implicit relations between content words. In this paper,
we revisit and extend these dependency graph representations in
light of the recent Universal Dependencies (UD) initiative and
provide a detailed account of an enhanced and an enhanced++
English UD representation. We further present a converter from
constituency to basic, i.e., strict surface structure, UD trees, and
a converter from basic UD trees to enhanced and enhanced++
English UD graphs. We release both converters as part of Stanford
CoreNLP and the Stanford Parser.
A Proposition Bank of Urdu
Maaz Anwar, Riyaz Ahmad Bhat, Dipti Sharma, AshwiniVaidya, Martha Palmer and Tafseer Ahmed Khan
This paper describes our efforts for the development of a
Proposition Bank for Urdu, an Indo-Aryan language. Our primary
goal is the labeling of syntactic nodes in the existing Urdu
dependency Treebank with specific argument labels. In essence,
it involves annotation of predicate argument structures of both
simple and complex predicates in the Treebank corpus. We
describe the overall process of building the PropBank of Urdu.
We discuss various statistics pertaining to the Urdu PropBank and
the issues which the annotators encountered while developing the
PropBank. We also discuss how these challenges were addressed
to successfully expand the PropBank corpus. While reporting
the Inter-annotator agreement between the two annotators, we
show that the annotators share similar understanding of the
annotation guidelines and of the linguistic phenomena present in
the language. The present size of this Propbank is around 180,000
tokens which is double-propbanked by the two annotators for
82
simple predicates. Another 100,000 tokens have been annotated
for complex predicates of Urdu.
Czech Legal Text Treebank 1.0
Vincent Kríž, Barbora Hladka and Zdenka Uresova
We introduce a new member of the family of Prague
dependency treebanks. The Czech Legal Text Treebank 1.0 is
a morphologically and syntactically annotated corpus of 1,128
sentences. The treebank contains texts from the legal domain,
namely the documents from the Collection of Laws of the Czech
Republic. Legal texts differ from other domains in several
language phenomena influenced by rather high frequency of very
long sentences. A manual annotation of such sentences presents
a new challenge. We describe a strategy and tools for this task.
The resulting treebank can be explored in various ways. It can be
downloaded from the LINDAT/CLARIN repository and viewed
locally using the TrEd editor or it can be accessed on-line using
the KonText and TreeQuery tools.
P30 - Linked DataThursday, May 26, 14:55
Chairperson: Felix Sasaki Poster Session
Concepticon: A Resource for the Linking ofConcept Lists
Johann-Mattis List, Michael Cysouw and Robert Forkel
We present an attempt to link the large amount of different concept
lists which are used in the linguistic literature, ranging from
Swadesh lists in historical linguistics to naming tests in clinical
studies and psycholinguistics. This resource, our Concepticon,
links 30 222 concept labels from 160 conceptlists to 2495 concept
sets. Each concept set is given a unique identifier, a unique
label, and a human-readable definition. Concept sets are further
structured by defining different relations between the concepts.
The resource can be used for various purposes. Serving as
a rich reference for new and existing databases in diachronic
and synchronic linguistics, it allows researchers a quick access
to studies on semantic change, cross-linguistic polysemies, and
semantic associations.
LVF-lemon – Towards a Linked DataRepresentation of “Les Verbes français”
Ingrid Falk and Achim Stein
In this study we elaborate a road map for the conversion of a
traditional lexical syntactico-semantic resource for French into
a linguistic linked open data (LLOD) model. Our approach
uses current best-practices and the analyses of earlier similar
undertakings (lemonUBY and PDEV-lemon) to tease out the most
appropriate representation for our resource.
Riddle Generation using Word Associations
Paloma Galvan, Virginia Francisco, Raquel Hervas andGonzalo Mendez
In knowledge bases where concepts have associated properties,
there is a large amount of comparative information that is
implicitly encoded in the values of the properties these concepts
share. Although there have been previous approaches to
generating riddles, none of them seem to take advantage
of structured information stored in knowledge bases such as
Thesaurus Rex, which organizes concepts according to the fine
grained ad-hoc categories they are placed into by speakers in
everyday language, along with associated properties or modifiers.
Taking advantage of these shared properties, we have developed
a riddle generator that creates riddles about concepts represented
as common nouns. The base of these riddles are comparisons
between the target concept and other entities that share some of
its properties. In this paper, we describe the process we have
followed to generate the riddles starting from the target concept
and we show the results of the first evaluation we have carried out
to test the quality of the resulting riddles.
Challenges of Adjective Mapping betweenplWordNet and Princeton WordNet
Ewa Rudnicka, Wojciech Witkowski and KatarzynaPodlaska
The paper presents the strategy and results of mapping adjective
synsets between plWordNet (the wordnet of Polish, cf. Piasecki
et al. 2009, Maziarz et al. 2013) and Princeton WordNet
(cf. Fellbaum 1998). The main challenge of this enterprise
has been very different synset relation structures in the two
networks: horizontal, dumbbell-model based in PWN and vertical,
hyponymy-based in plWN. Moreover, the two wordnets display
differences in the grouping of adjectives into semantic domains
and in the size of the adjective category. The handle the above
contrasts, a series of automatic prompt algorithms and a manual
mapping procedure relying on corresponding synset and lexical
unit relations as well as on inter-lingual relations between noun
synsets were proposed in the pilot stage of mapping (Rudnicka et
al. 2015). In the paper we discuss the final results of the mapping
83
process as well as explain example mapping choices. Suggestions
for further development of mapping are also given.
Relation- and Phrase-level Linking of FrameNetwith Sar-graphs
Aleksandra Gabryszak, Sebastian Krause, LeonhardHennig, Feiyu Xu and Hans Uszkoreit
Recent research shows the importance of linking linguistic
knowledge resources for the creation of large-scale linguistic
data. We describe our approach for combining two English
resources, FrameNet and sar-graphs, and illustrate the benefits of
the linked data in a relation extraction setting. While FrameNet
consists of schematic representations of situations, linked to
lexemes and their valency patterns, sar-graphs are knowledge
resources that connect semantic relations from factual knowledge
graphs to the linguistic phrases used to express instances of these
relations. We analyze the conceptual similarities and differences
of both resources and propose to link sar-graphs and FrameNet
on the levels of relations/frames as well as phrases. The former
alignment involves a manual ontology mapping step, which allows
us to extend sar-graphs with new phrase patterns from FrameNet.
The phrase-level linking, on the other hand, is fully automatic. We
investigate the quality of the automatically constructed links and
identify two main classes of errors.
Mapping Ontologies Using Ontologies:Cross-lingual Semantic Role Information Transfer
Balázs Indig, Márton Miháltz and András Simonyi
This paper presents the process of enriching the verb frame
database of a Hungarian natural language parser to enable the
assignment of semantic roles. We accomplished this by linking
the parser’s verb frame database to existing linguistic resources
such as VerbNet and WordNet, and automatically transferring
back semantic knowledge. We developed OWL ontologies that
map the various constraint description formalisms of the linked
resources and employed a logical reasoning device to facilitate the
linking procedure. We present results and discuss the challenges
and pitfalls that arose from this undertaking.
Generating a Large-Scale Entity LinkingDictionary from Wikipedia Link Structure andArticle Text
Ravindra Harige and Paul Buitelaar
Wikipedia has been increasingly used as a knowledge base for
open-domain Named Entity Linking and Disambiguation. In this
task, a dictionary with entity surface forms plays an important
role in finding a set of candidate entities for the mentions in text.
Existing dictionaries mostly rely on the Wikipedia link structure,
like anchor texts, redirect links and disambiguation links. In this
paper, we introduce a dictionary for Entity Linking that includes
name variations extracted from Wikipedia article text, in addition
to name variations derived from the Wikipedia link structure. With
this approach, we show an increase in the coverage of entities and
their mentions in the dictionary in comparison to other Wikipedia
based dictionaries.
The Open Linguistics Working Group:Developing the Linguistic Linked Open DataCloud
John Philip McCrae, Christian Chiarcos, Francis Bond,Philipp Cimiano, Thierry Declerck, Gerard de Melo,Jorge Gracia, Sebastian Hellmann, Bettina Klimek, StevenMoran, Petya Osenova, Antonio Pareja-Lora and JonathanPool
The Open Linguistics Working Group (OWLG) brings together
researchers from various fields of linguistics, natural language
processing, and information technology to present and discuss
principles, case studies, and best practices for representing,
publishing and linking linguistic data collections. A major
outcome of our work is the Linguistic Linked Open Data (LLOD)
cloud, an LOD (sub-)cloud of linguistic resources, which covers
various linguistic databases, lexicons, corpora, terminologies, and
metadata repositories. We present and summarize five years of
progress on the development of the cloud and of advancements
in open data in linguistics, and we describe recent community
activities. The paper aims to serve as a guideline to orient and
involve researchers with the community and/or Linguistic Linked
Open Data.
Cross-lingual RDF Thesauri Interlinking
Tatiana Lesnikova, Jérôme David and Jérôme Euzenat
Various lexical resources are being published in RDF. To enhance
the usability of these resources, identical resources in different
data sets should be linked. If lexical resources are described
in different natural languages, then techniques to deal with
multilinguality are required for interlinking. In this paper,
we evaluate machine translation for interlinking concepts, i.e.,
generic entities named with a common noun or term. In our
previous work, the evaluated method has been applied on named
entities. We conduct two experiments involving different thesauri
in different languages. The first experiment involves concepts
from the TheSoz multilingual thesaurus in three languages:
English, French and German. The second experiment involves
concepts from the EuroVoc and AGROVOC thesauri in English
and Chinese respectively. Our results demonstrate that machine
84
translation can be beneficial for cross-lingual thesauri interlinking
independently of a dataset structure.
P31 - LR Infrastructures and Architectures (1)Thursday, May 26, 14:55
Chairperson: Yohei Murakami Poster Session
The Language Resource Life Cycle: Towards aGeneric Model for Creating, Maintaining, Usingand Distributing Language Resources
Georg Rehm
Language Resources (LRs) are an essential ingredient of current
approaches in Linguistics, Computational Linguistics, Language
Technology and related fields. LRs are collections of spoken or
written language data, typically annotated with linguistic analysis
information. Different types of LRs exist, for example, corpora,
ontologies, lexicons, collections of spoken language data (audio),
or collections that also include video (multimedia, multimodal).
Often, LRs are distributed with specific tools, documentation,
manuals or research publications. The different phases that
involve creating and distributing an LR can be conceptualised as
a life cycle. While the idea of handling the LR production and
maintenance process in terms of a life cycle has been brought up
quite some time ago, a best practice model or common approach
can still be considered a research gap. This article wants to help
fill this gap by proposing an initial version of a generic Language
Resource Life Cycle that can be used to inform, direct, control
and evaluate LR research and development activities (including
description, management, production, validation and evaluation
workflows).
A Large-scale Recipe and Meal Data Collection asInfrastructure for Food Research
Jun Harashima, Michiaki Ariga, Kenta Murata andMasayuki Ioki
Everyday meals are an important part of our daily lives and,
currently, there are many Internet sites that help us plan these
meals. Allied to the growth in the amount of food data such
as recipes available on the Internet is an increase in the number
of studies on these data, such as recipe analysis and recipe
search. However, there are few publicly available resources for
food research; those that do exist do not include a wide range
of food data or any meal data (that is, likely combinations of
recipes). In this study, we construct a large-scale recipe and
meal data collection as the underlying infrastructure to promote
food research. Our corpus consists of approximately 1.7 million
recipes and 36000 meals in cookpad, one of the largest recipe
sites in the world. We made the corpus available to researchers
in February 2015 and as of February 2016, 82 research groups at
56 universities have made use of it to enhance their studies.
EstNLTK - NLP Toolkit for EstonianSiim Orasmaa, Timo Petmanson, Alexander Tkachenko,Sven Laur and Heiki-Jaan Kaalep
Although there are many tools for natural language processing
tasks in Estonian, these tools are very loosely interoperable,
and it is not easy to build practical applications on top of
them. In this paper, we introduce a new Python library for
natural language processing in Estonian, which provides unified
programming interface for various NLP components. The
EstNLTK toolkit provides utilities for basic NLP tasks including
tokenization, morphological analysis, lemmatisation and named
entity recognition as well as offers more advanced features such
as a clause segmentation, temporal expression extraction and
normalization, verb chain detection, Estonian Wordnet integration
and rule-based information extraction. Accompanied by a detailed
API documentation and comprehensive tutorials, EstNLTK is
suitable for a wide range of audience. We believe EstNLTK is
mature enough to be used for developing NLP-backed systems
both in industry and research. EstNLTK is freely available under
the GNU GPL version 2+ license, which is standard for academic
software.
South African National Centre for DigitalLanguage ResourcesJustus Roux
This presentation introduces the imminent establishment of a new
language resource infrastructure focusing on languages spoken in
Southern Africa, with an eventual aim to become a hub for digital
language resources within Sub-Saharan Africa. The Constitution
of South Africa makes provision for 11 official languages all with
equal status. The current language Resource Management Agency
will be merged with the new Centre, which will have a wider
focus than that of data acquisition, management and distribution.
The Centre will entertain two main programs: Digitisation and
Digital Humanities. The digitisation program will focus on the
systematic digitisation of relevant text, speech and multi-modal
data across the official languages. Relevancy will be determined
by a Scientific Advisory Board. This will take place on a
continuous basis through specified projects allocated to national
members of the Centre, as well as through open-calls aimed at
the academic as well as local communities. The digital resources
will be managed and distributed through a dedicated web-based
portal. The development of the Digital Humanities program
will entail extensive academic support for projects implementing
digital language based data. The Centre will function as an
85
enabling research infrastructure primarily supported by national
government and hosted by the North-West University.
Design and Development of the MERLIN LearnerCorpus Platform
Verena Lyding and Karin Schöne
In this paper, we report on the design and development of an
online search platform for the MERLIN corpus of learner texts
in Czech, German and Italian. It was created in the context of the
MERLIN project, which aims at empirically illustrating features
of the Common European Framework of Reference (CEFR) for
evaluating language competences based on authentic learner text
productions compiled into a learner corpus. Furthermore, the
project aims at providing access to the corpus through a search
interface adapted to the needs of multifaceted target groups
involved with language learning and teaching. This article starts
by providing a brief overview on the project ambition, the data
resource and its intended target groups. Subsequently, the main
focus of the article is on the design and development process of
the platform, which is carried out in a user-centred fashion. The
paper presents the user studies carried out to collect requirements,
details the resulting decisions concerning the platform design and
its implementation, and reports on the evaluation of the platform
prototype and final adjustments.
FLAT: Constructing a CLARIN CompatibleHome for Language Resources
Menzo Windhouwer, Marc Kemps-Snijders, Paul Trilsbeek,André Moreira, Bas Van der Veen, Guilherme Silva andDaniel Von Reihn
Language resources are valuable assets, both for institutions
and researchers. To safeguard these resources requirements for
repository systems and data management have been specified by
various branch organizations, e.g., CLARIN and the Data Seal
of Approval. This paper describes these and some additional
ones posed by the authors’ home institutions. And it shows
how they are met by FLAT, to provide a new home for language
resources. The basis of FLAT is formed by the Fedora Commons
repository system. This repository system can meet many of
the requirements out-of-the box, but still additional configuration
and some development work is needed to meet the remaining
ones, e.g., to add support for Handles and Component Metadata.
This paper describes design decisions taken in the construction of
FLAT’s system architecture via a mix-and-match strategy, with a
preference for the reuse of existing solutions. FLAT is developed
and used by the Meertens Institute and The Language Archive, but
is also freely available for anyone in need of a CLARIN-compliant
repository for their language resources.
CLARIAH in the Netherlands
Jan Odijk
I introduce CLARIAH in the Netherlands, which aims to
contribute the Netherlands part of a Europe-wide humanities
research infrastructure. I describe the digital turn in the
humanities, the background and context of CLARIAH, both
nationally and internationally, its relation to the CLARIN and
DARIAH infrastructures, and the rationale for joining forces
between CLARIN and DARIAH in the Netherlands. I also
describe the first results of joining forces as achieved in the
CLARIAH-SEED project, and the plans of the CLARIAH-CORE
project, which is currently running
Crosswalking from CMDI to Dublin Core andMARC 21
Claus Zinn, Thorsten Trippel, Steve Kaminski and EmanuelDima
The Component MetaData Infrastructure (CMDI) is a framework
for the creation and usage of metadata formats to describe all
kinds of resources in the CLARIN world. To better connect to
the library world, and to allow librarians to enter metadata for
linguistic resources into their catalogues, a crosswalk from CMDI-
based formats to bibliographic standards is required. The general
and rather fluid nature of CMDI, however, makes it hard to map
arbitrary CMDI schemas to metadata standards such as Dublin
Core (DC) or MARC 21, which have a mature, well-defined and
fixed set of field descriptors. In this paper, we address the issue
and propose crosswalks between CMDI-based profiles originating
from the NaLiDa project and DC and MARC 21, respectively.
Data Management Plans and Data Centers
Denise DiPersio, Christopher Cieri and Daniel Jaquette
Data management plans, data sharing plans and the like are now
required by funders worldwide as part of research proposals.
Concerned with promoting the notion of open scientific data,
funders view such plans as the framework for satisfying the
generally accepted requirements for data generated in funded
research projects, among them that it be accessible, usable,
standardized to the degree possible, secure and stable. This
paper examines the origins of data management plans, their
86
requirements and issues they raise for data centers and HLT
resource development in general.
UIMA-Based JCoRe 2.0 Goes GitHub and MavenCentral — State-of-the-Art Software ResourceEngineering and Distribution of NLP Pipelines
Udo Hahn, Franz Matthies, Erik Faessler and JohannesHellrich
We introduce JCoRe 2.0, the relaunch of a UIMA-based open
software repository for full-scale natural language processing
originating from the Jena University Language & Information
Engineering (JULIE) Lab. In an attempt to put the new release
of JCoRe on firm software engineering ground, we uploaded it to
GitHub, a social coding platform, with an underlying source code
versioning system and various means to support collaboration
for software development and code modification management.
In order to automate the builds of complex NLP pipelines and
properly represent and track dependencies of the underlying Java
code, we incorporated Maven as part of our software configuration
management efforts. In the meantime, we have deployed our
artifacts on Maven Central, as well. JCoRe 2.0 offers a broad
range of text analytics functionality (mostly) for English-language
scientific abstracts and full-text articles, especially from the life
sciences domain.
Facilitating Metadata Interoperability inCLARIN-DK
Lene Offersgaard and Dorte Haltrup Hansen
The issue for CLARIN archives at the metadata level is to facilitate
the user’s possibility to describe their data, even with their own
standard, and at the same time make these metadata meaningful
for a variety of users with a variety of resource types, and ensure
that the metadata are useful for search across all resources both
at the national and at the European level. We see that different
people from different research communities fill in the metadata
in different ways even though the metadata was defined and
documented. This has impacted when the metadata are harvested
and displayed in different environments. A loss of information is
at stake. In this paper we view the challenges of ensuring metadata
interoperability through examples of propagation of metadata
values from the CLARIN-DK archive to the VLO. We see that
the CLARIN Community in many ways support interoperability,
but argue that agreeing upon standards, making clear definitions of
the semantics of the metadata and their content is inevitable for the
interoperability to work successfully. The key points are clear and
freely available definitions, accessible documentation and easily
usable facilities and guidelines for the metadata creators.
The Language Application Grid and Galaxy
Nancy Ide, Keith Suderman, James Pustejovsky, MarcVerhagen and Christopher Cieri
The NSF-SI2-funded LAPPS Grid project is a collaborative
effort among Brandeis University, Vassar College, Carnegie-
Mellon University (CMU), and the Linguistic Data Consortium
(LDC), which has developed an open, web-based infrastructure
through which resources can be easily accessed and within
which tailored language services can be efficiently composed,
evaluated, disseminated and consumed by researchers, developers,
and students across a wide variety of disciplines. The LAPPS
Grid project recently adopted Galaxy (Giardine et al., 2005),
a robust, well-developed, and well-supported front end for
workflow configuration, management, and persistence. Galaxy
allows data inputs and processing steps to be selected from
graphical menus, and results are displayed in intuitive plots
and summaries that encourage interactive workflows and the
exploration of hypotheses. The Galaxy workflow engine provides
significant advantages for deploying pipelines of LAPPS Grid web
services, including not only means to create and deploy locally-
run and even customized versions of the LAPPS Grid as well as
running the LAPPS Grid in the cloud, but also access to a huge
array of statistical and visualization tools that have been developed
for use in genomics research.
P32 - Large Projects and Infrastructures (1)Thursday, May 26, 14:55
Chairperson: Zygmunt Vetulani Poster Session
The IPR-cleared Corpus of ContemporaryWritten and Spoken Romanian Language
Dan Tufis, Verginica Barbu Mititelu, Elena Irimia, StefanDaniel Dumitrescu and Tiberiu Boros
The article describes the current status of a large national
project, CoRoLa, aiming at building a reference corpus for
the contemporary Romanian language. Unlike many other
national corpora, CoRoLa contains only - IPR cleared texts
and speech data, obtained from some of the country’s most
representative publishing houses, broadcasting agencies, editorial
offices, newspapers and popular bloggers. For the written
component 500 million tokens are targeted and for the oral one
300 hours of recordings. The choice of texts is done according
to their functional style, domain and subdomain, also with an
eye to the international practice. A metadata file (following
87
the CMDI model) is associated to each text file. Collected
texts are cleaned and transformed in a format compatible with
the tools for automatic processing (segmentation, tokenization,
lemmatization, part-of-speech tagging). The paper also presents
up-to-date statistics about the structure of the corpus almost two
years before its official launching. The corpus will be freely
available for searching. Users will be able to download the
results of their searches and those original files when not against
stipulations in the protocols we have with text providers.
SYN2015: Representative Corpus ofContemporary Written Czech
Michal Kren, Václav Cvrcek, Tomáš Capka, AnnaCermáková, Milena Hnátková, Lucie Chlumská, TomášJelínek, Dominika Kováríková, Vladimír Petkevic, PavelProcházka, Hana Skoumalová, Michal Škrabal, PetrTrunecek, Pavel Vondricka and Adrian Jan Zasina
The paper concentrates on the design, composition and annotation
of SYN2015, a new 100-million representative corpus of
contemporary written Czech. SYN2015 is a sequel of the
representative corpora of the SYN series that can be described
as traditional (as opposed to the web-crawled corpora), featuring
cleared copyright issues, well-defined composition, reliability
of annotation and high-quality text processing. At the same
time, SYN2015 is designed as a reflection of the variety of
written Czech text production with necessary methodological and
technological enhancements that include a detailed bibliographic
annotation and text classification based on an updated scheme.
The corpus has been produced using a completely rebuilt text
processing toolchain called SynKorp. SYN2015 is lemmatized,
morphologically and syntactically annotated with state-of-the-art
tools. It has been published within the framework of the Czech
National Corpus and it is available via the standard corpus query
interface KonText at http://kontext.korpus.cz as well
as a dataset in shuffled format.
LREC as a Graph: People and Resources in aNetwork
Riccardo Del Gratta, Francesca Frontini, MonicaMonachini, Gabriella Pardelli, Irene Russo, RobertoBartolini, Fahad Khan, Claudia Soria and NicolettaCalzolari
This proposal describes a new way to visualise resources in
the LREMap, a community-built repository of language resource
descriptions and uses. The LREMap is represented as a force-
directed graph, where resources, papers and authors are nodes.
The analysis of the visual representation of the underlying graph
is used to study how the community gathers around LRs and how
LRs are used in research.
The Public License Selector: Making OpenLicensing EasierPawel Kamocki, Pavel Stranák and Michal Sedlák
Researchers in Natural Language Processing rely on availability
of data and software, ideally under open licenses, but little is
done to actively encourage it. In fact, the current Copyright
framework grants exclusive rights to authors to copy their works,
make them available to the public and make derivative works (such
as annotated language corpora). Moreover, in the EU databases
are protected against unauthorized extraction and re-utilization of
their contents. Therefore, proper public licensing plays a crucial
role in providing access to research data. A public license is a
license that grants certain rights not to one particular user, but to
the general public (everybody). Our article presents a tool that we
developed and whose purpose is to assist the user in the licensing
process. As software and data should be licensed under different
licenses, the tool is composed of two separate parts: Data and
Software. The underlying logic as well as elements of the graphic
interface are presented below.
NLP Infrastructure for the Lithuanian LanguageDaiva Vitkut-Adžgauskien, Andrius Utka, DariusAmilevicius and Tomas Krilavicius
The Information System for Syntactic and Semantic Analysis
of the Lithuanian language (lith. Lietuvi kalbos sintaksins ir
semantins analizs informacin sistema, LKSSAIS) is the first
infrastructure for the Lithuanian language combining Lithuanian
language tools and resources for diverse linguistic research and
applications tasks. It provides access to the basic as well
as advanced natural language processing tools and resources,
including tools for corpus creation and management, text
preprocessing and annotation, ontology building, named entity
recognition, morphosyntactic and semantic analysis, sentiment
analysis, etc. It is an important platform for researchers and
developers in the field of natural language technology.
CodE Alltag: A German-Language E-MailCorpusUlrike Krieg-Holz, Christian Schuschnig, Franz Matthies,Benjamin Redling and Udo Hahn
We introduce CODE ALLTAG, a text corpus composed of
German-language e-mails. It is divided into two partitions: the
first of these portions, CODE ALLTAG_ XL, consists of a bulk-
size collection drawn from an openly accessible e-mail archive
(roughly 1.5M e-mails), whereas the second portion, CODE
ALLTAG_ S+d, is much smaller in size (less than thousand e-
mails), yet excels with demographic data from each author of an
88
e-mail. CODE ALLTAG, thus, currently constitutes the largest E-
Mail corpus ever built. In this paper, we describe, for both parts,
the solicitation process for gathering e-mails, present descriptive
statistical properties of the corpus, and, for CODE ALLTAG_ S+d,
reveal a compilation of demographic features of the donors of e-
mails.
P33 - Morphology (2)Thursday, May 26, 14:55
Chairperson: Felice dell’Orletta Poster Session
Rapid Development of Morphological Analyzersfor Typologically Diverse Languages
Seth Kulick and Ann Bies
The Low Resource Language research conducted under DARPA’s
Broad Operational Language Translation (BOLT) program
required the rapid creation of text corpora of typologically
diverse languages (Turkish, Hausa, and Uzbek) which were
annotated with morphological information, along with other types
of annotation. Since the output of morphological analyzers is
a significant aid to morphological annotation, we developed a
morphological analyzer for each language in order to support
the annotation task, and also as a deliverable by itself. Our
framework for analyzer creation results in tables similar to those
used in the successful SAMA analyzer for Arabic, but with a more
abstract linguistic level, from which the tables are derived. A
lexicon was developed from available resources for integration
with the analyzer, and given the speed of development and
uncertain coverage of the lexicon, we assumed that the analyzer
would necessarily be lacking in some coverage for the project
annotation. Our analyzer framework was therefore focused on
rapid implementation of the key structures of the language,
together with accepting “wildcard” solutions as possible analyses
for a word with an unknown stem, building upon our similar
experiences with morphological annotation with Modern Standard
Arabic and Egyptian Arabic.
A Neural Lemmatizer for Bengali
Abhisek Chakrabarty, Akshay Chaturvedi and UtpalGarain
We propose a novel neural lemmatization model which is
language independent and supervised in nature. To handle the
words in a neural framework, word embedding technique is
used to represent words as vectors. The proposed lemmatizer
makes use of contextual information of the surface word to be
lemmatized. Given a word along with its contextual neighbours
as input, the model is designed to produce the lemma of
the concerned word as output. We introduce a new network
architecture that permits only dimension specific connections
between the input and the output layer of the model. For the
present work, Bengali is taken as the reference language. Two
datasets are prepared for training and testing purpose consisting
of 19,159 and 2,126 instances respectively. As Bengali is a
resource scarce language, these datasets would be beneficial for
the respective research community. Evaluation method shows
that the neural lemmatizer achieves 69.57% accuracy on the
test dataset and outperforms the simple cosine similarity based
baseline strategy by a margin of 1.37%.
A Finite-state Morphological Analyser for Tuvan
Francis Tyers, Aziyana Bayyr-ool, Aelita Salchak andJonathan Washington
This paper describes the development of free/open-source
finite-state morphological transducers for Tuvan, a Turkic
language spoken in and around the Tuvan Republic in Russia.
The finite-state toolkit used for the work is the Helsinki
Finite-State Toolkit (HFST), we use the lexc formalism for
modelling the morphotactics and twol formalism for modelling
morphophonological alternations. We present a novel description
of the morphological combinatorics of pseudo-derivational
morphemes in Tuvan. An evaluation is presented which shows
that the transducer has a reasonable coverage–around 93%–on
freely-available corpora of the languages, and high precision–over
99%–on a manually verified test set.
Tzaurs.lv: the Largest Open Lexical Database forLatvian
Andrejs Spektors, Ilze Auzina, Roberts Daris, NormundsGrzitis, Pteris Paikens, Lauma Pretkalnina, Laura Ritumaand Baiba Saulite
We describe an extensive and versatile lexical resource for
Latvian, an under-resourced Indo-European language, which we
call Tezaurs (Latvian for ‘thesaurus’). It comprises a large
explanatory dictionary of more than 250,000 entries that are
derived from more than 280 external sources. The dictionary
is enriched with phonetic, morphological, semantic and other
annotations, as well as augmented by various language processing
tools allowing for the generation of inflectional forms and
pronunciation, for on-the-fly selection of corpus examples, for
suggesting synonyms, etc. Tezaurs is available as a public and
widely used web application for end-users, as an open data set
for the use in language technology (LT), and as an API – a set of
web services for the integration into third-party applications. The
ultimate goal of Tezaurs is to be the central computational lexicon
for Latvian, bringing together all Latvian words and frequently
89
used multi-word units and allowing for the integration of other LT
resources and tools.
A Finite-State Morphological Analyser for Sindhi
Raveesh Motlani, Francis Tyers and Dipti Sharma
Morphological analysis is a fundamental task in natural-language
processing, which is used in other NLP applications such as
part-of-speech tagging, syntactic parsing, information retrieval,
machine translation, etc. In this paper, we present our work on
the development of free/open-source finite-state morphological
analyser for Sindhi. We have used Apertium’s lttoolbox as our
finite-state toolkit to implement the transducer. The system is
developed using a paradigm-based approach, wherein a paradigm
defines all the word forms and their morphological features for a
given stem (lemma). We have evaluated our system on the Sindhi
Wikipedia corpus and achieved a reasonable coverage of 81% and
a precision of over 97%.
Deriving Morphological Analyzers from ExampleInflections
Markus Forsberg and Mans Hulden
This paper presents a semi-automatic method to derive
morphological analyzers from a limited number of example
inflections suitable for languages with alphabetic writing systems.
The system we present learns the inflectional behavior of
morphological paradigms from examples and converts the learned
paradigms into a finite-state transducer that is able to map inflected
forms of previously unseen words into lemmas and corresponding
morphosyntactic descriptions. We evaluate the system when
provided with inflection tables for several languages collected
from the Wiktionary.
Morphological Analysis of Sahidic Coptic forAutomatic Glossing
Daniel Smith and Mans Hulden
We report on the implementation of a morphological analyzer for
the Sahidic dialect of Coptic, a now extinct Afro-Asiatic language.
The system is developed in the finite-state paradigm. The main
purpose of the project is provide a method by which scholars
and linguists can semi-automatically gloss extant texts written in
Sahidic. Since a complete lexicon containing all attested forms
in different manuscripts requires significant expertise in Coptic
spanning almost 1,000 years, we have equipped the analyzer with
a core lexicon and extended it with a “guesser” ability to capture
out-of-vocabulary items in any inflection. We also suggest an
ASCII transliteration for the language. A brief evaluation is
provided.
The on-line version of Grammatical Dictionary ofPolish
Marcin Wolinski and Witold Kieras
We present the new online edition of a dictionary of Polish
inflection – the Grammatical Dictionary of Polish (http://
sgjp.pl). The dictionary is interesting for several reasons: it
is comprehensive (over 330,000 lexemes corresponding to almost
4,300,000 different textual words; 1116 handcrafted inflectional
patterns), the inflection is presented in an explicit manner in the
form of carefully designed tables, the user interface facilitates
advanced queries by several features (lemmas, forms, applicable
grammatical categories, types of inflection). Moreover, the data
of the dictionary is used in morphological analysers, including our
product Morfeusz (http://sgjp.pl/morfeusz). From the
start, the dictionary was meant to be comfortable for the human
reader as well as to be ready for use in NLP applications. In the
paper we briefly discuss both aspects of the resource.
P34 - Semantic LexiconsThursday, May 26, 14:55
Chairperson: Kiril Simov Poster Session
Automatically Generated Affective Norms ofAbstractness, Arousal, Imageability and Valencefor 350 000 German Lemmas
Maximilian Köper and Sabine Schulte im Walde
This paper presents a collection of 350$ $000 German lemmatised
words, rated on four psycholinguistic affective attributes. All
ratings were obtained via a supervised learning algorithm that
can automatically calculate a numerical rating of a word. We
applied this algorithm to abstractness, arousal, imageability
and valence. Comparison with human ratings reveals high
correlation across all rating types. The full resource is publically
available at: http://www.ims.uni-stuttgart.de/
data/affective_norms/.
Latin Vallex. A Treebank-based Semantic ValencyLexicon for Latin
Marco Passarotti, Berta González Saavedra andChristophe Onambele
Despite a centuries-long tradition in lexicography, Latin lacks
state-of-the-art computational lexical resources. This situation is
strictly related to the still quite limited amount of linguistically
90
annotated textual data for Latin, which can help the building of
new lexical resources by supporting them with empirical evidence.
However, projects for creating new language resources for Latin
have been launched over the last decade to fill this gap. In this
paper, we present Latin Vallex, a valency lexicon for Latin built in
mutual connection with the semantic and pragmatic annotation of
two Latin treebanks featuring texts of different eras. On the one
hand, such a connection between the empirical evidence provided
by the treebanks and the lexicon allows to enhance each frame
entry in the lexicon with its frequency in real data. On the other
hand, each valency-capable word in the treebanks is linked to a
frame entry in the lexicon.
A Framework for Cross-lingual/Node-wiseAlignment of Lexical-Semantic Resources
Yoshihiko Hayashi
Given lexical-semantic resources in different languages, it is
useful to establish cross-lingual correspondences, preferably with
semantic relation labels, between the concept nodes in these
resources. This paper presents a framework for enabling a cross-
lingual/node-wise alignment of lexical-semantic resources, where
cross-lingual correspondence candidates are first discovered and
ranked, and then classified by a succeeding module. Indeed,
we propose that a two-tier classifier configuration is feasible
for the second module: the first classifier filters out possibly
irrelevant correspondence candidates and the second classifier
assigns a relatively fine-grained semantic relation label to each
of the surviving candidates. The results of Japanese-to-English
alignment experiments using EDR Electronic Dictionary and
Princeton WordNet are described to exemplify the validity of the
proposal.
Lexical Coverage Evaluation of Large-scaleMultilingual Semantic Lexicons for TwelveLanguages
Scott Piao, Paul Rayson, Dawn Archer, FrancescaBianchi, Carmen Dayrell, Mahmoud El-Haj, Ricardo-María Jiménez, Dawn Knight, Michal Kren, Laura Löfberg,Rao Muhammad Adeel Nawab, Jawad Shafi, Phoey Lee Tehand Olga Mudraya
The last two decades have seen the development of various
semantic lexical resources such as WordNet (Miller, 1995) and the
USAS semantic lexicon (Rayson et al., 2004), which have played
an important role in the areas of natural language processing
and corpus-based studies. Recently, increasing efforts have
been devoted to extending the semantic frameworks of existing
lexical knowledge resources to cover more languages, such as
EuroWordNet and Global WordNet. In this paper, we report on
the construction of large-scale multilingual semantic lexicons for
twelve languages, which employ the unified Lancaster semantic
taxonomy and provide a multilingual lexical knowledge base
for the automatic UCREL semantic annotation system (USAS).
Our work contributes towards the goal of constructing larger-
scale and higher-quality multilingual semantic lexical resources
and developing corpus annotation tools based on them. Lexical
coverage is an important factor concerning the quality of the
lexicons and the performance of the corpus annotation tools, and
in this experiment we focus on evaluating the lexical coverage
achieved by the multilingual lexicons and semantic annotation
tools based on them. Our evaluation shows that some semantic
lexicons such as those for Finnish and Italian have achieved lexical
coverage of over 90% while others need further expansion.
Building Concept Graphs from MonolingualDictionary EntriesGábor Recski
We present the dict_ to_ 4lang tool for processing entries
of three monolingual dictionaries of English and mapping
definitions to concept graphs following the 4lang principles of
semantic representation introduced by (Kornai, 2010). 4lang
representations are domain- and language-independent, and make
use of only a very limited set of primitives to encode the meaning
of all utterances. Our pipeline relies on the Stanford Dependency
Parser for syntactic analysis, the dep to 4lang module then
builds directed graphs of concepts based on dependency relations
between words in each definition. Several issues are handled by
construction-specific rules that are applied to the output of dep_
to_ 4lang. Manual evaluation suggests that ca. 75% of graphs built
from the Longman Dictionary are either entirely correct or contain
only minor errors. dict_ to_ 4lang is available under an MIT
license as part of the 4lang library and has been used successfully
in measuring Semantic Textual Similarity (Recski and Ács, 2015).
An interactive demo of core 4lang functionalities is available at
http://4lang.hlt.bme.hu.
Semantic Layer of the Valence Dictionary ofPolish WalentyElzbieta Hajnicz, Anna Andrzejczuk and Tomasz Bartosiak
This article presents the semantic layer of Walenty—a new
valence dictionary of Polish predicates, with a number of novel
features, as compared to other such dictionaries. The dictionary
contains two layers, syntactic and semantic. The syntactic layer
describes syntactic and morphosyntactic constraints predicates put
on their dependants. In particular, it includes a comprehensive
and powerful phraseological component. The semantic layer
shows how predicates and their arguments are involved in
a described situation in an utterance. These two layers
91
are connected, representing how semantic arguments can be
realised on the surface. Each syntactic schema and each
semantic frame are illustrated by at least one exemplary sentence
attested in linguistic reality. The semantic layer consists of
semantic frames represented as lists of pairs <semantic role,
selectional preference> and connected with PlWordNet lexical
units. Semantic roles have a two-level representation (basic
roles are provided with an attribute) enabling representation of
arguments in a flexible way. Selectional preferences are based
on PlWordNet structure as well.
Italian VerbNet: A Construction-based Approachto Italian Verb Classification
Lucia Busso and Alessandro Lenci
This paper proposes a new method for Italian verb classification
-and a preliminary example of resulting classes- inspired by
Levin (1993) and VerbNet (Kipper-Schuler, 2005), yet partially
independent from these resources; we achieved such a result by
integrating Levin and VerbNet’s models of classification with
other theoretic frameworks and resources. The classification is
rooted in the constructionist framework (Goldberg, 1995; 2006)
and is distribution-based. It is also semantically characterized
by a link to FrameNet’ssemanticframesto represent the event
expressed by a class. However, the new Italian classes maintain
the hierarchic “tree” structure and monotonic nature of VerbNet’s
classes, and, where possible, the original names (e.g.: Verbs of
Killing, Verbs of Putting, etc.). We therefore propose here a
taxonomy compatible with VerbNet but at the same time adapted
to Italian syntax and semantics. It also addresses a number of
problems intrinsic to the original classifications, such as the role
of argument alternations, here regarded simply as epiphenomena,
consistently with the constructionist approach.
A Large Rated Lexicon with French MedicalWords
Natalia Grabar and Thierry Hamon
Patients are often exposed to medical terms, such as anosognosia,
myelodysplastic, or hepatojejunostomy, that can be semantically
complex and hardly understandable by non-experts in medicine.
Hence, it is important to assess which words are potentially non-
understandable and require further explanations. The purpose
of our work is to build specific lexicon in which the words
are rated according to whether they are understandable or non-
understandable. We propose to work with medical words in
French such as provided by an international medical terminology.
The terms are segmented in single words and then each word
is manually processed by three annotators. The objective is
to assign each word into one of the three categories: I can
understand, I am not sure, I cannot understand. The annotators
do not have medical training nor they present specific medical
problems. They are supposed to represent an average patient.
The inter-annotator agreement is then computed. The content
of the categories is analyzed. Possible applications in which
this lexicon can be helpful are proposed and discussed. The
rated lexicon is freely available for the research purposes. It
is accessible online at http://natalia.grabar.perso.
sfr.fr/rated-lexicon.html.
Best of Both Worlds: Making Word SenseEmbeddings Interpretable
Alexander Panchenko
Word sense embeddings represent a word sense as a low-
dimensional numeric vector. While this representation is
potentially useful for NLP applications, its interpretability is
inherently limited. We propose a simple technique that improves
interpretability of sense vectors by mapping them to synsets
of a lexical resource. Our experiments with AdaGram sense
embeddings and BabelNet synsets show that it is possible to
retrieve synsets that correspond to automatically learned sense
vectors with Precision of 0.87, Recall of 0.42 and AUC of 0.78.
VerbLexPor: a lexical resource with semanticroles for Portuguese
Leonardo Zilio, Maria José Bocorny Finatto and AlineVillavicencio
This paper presents a lexical resource developed for Portuguese.
The resource contains sentences annotated with semantic roles.
The sentences were extracted from two domains: Cardiology
research papers and newspaper articles. Both corpora were
analyzed with the PALAVRAS parser and subsequently processed
with a subcategorization frames extractor, so that each sentence
that contained at least one main verb was stored in a database
together with its syntactic organization. The annotation was
manually carried out by a linguist using an annotation interface.
Both the annotated and non-annotated data were exported to an
XML format, which is readily available for download. The reason
behind exporting non-annotated data is that there is syntactic
information collected from the parser annotation in the non-
annotated data, and this could be useful for other researchers. The
sentences from both corpora were annotated separately, so that
it is possible to access sentences either from the Cardiology or
from the newspaper corpus. The full resource presents more than
seven thousand semantically annotated sentences, containing 192
92
different verbs and more than 15 thousand individual arguments
and adjuncts.
A Multilingual Predicate Matrix
Maddalen Lopez de Lacalle, Egoitz Laparra, Itziar Aldabeand German Rigau
This paper presents the Predicate Matrix 1.3, a lexical
resource resulting from the integration of multiple sources of
predicate information including FrameNet, VerbNet, PropBank
and WordNet. This new version of the Predicate Matrix has
been extended to cover nominal predicates by adding mappings
to NomBank. Similarly, we have integrated resources in Spanish,
Catalan and Basque. As a result, the Predicate Matrix 1.3 provides
a multilingual lexicon to allow interoperable semantic analysis in
multiple languages.
A Gold Standard for Scalar Adjectives
Bryan Wilkinson and Oates Tim
We present a gold standard for evaluating scale membership
and the order of scalar adjectives. In addition to evaluating
existing methods of ordering adjectives, this knowledge will
aid in studying the organization of adjectives in the lexicon.
This resource is the result of two elicitation tasks conducted
with informants from Amazon Mechanical Turk. The first
task is notable for gathering open-ended lexical data from
informants. The data is analyzed using Cultural Consensus
Theory, a framework from anthropology, to not only determine
scale membership but also the level of consensus among the
informants (Romney et al., 1986). The second task gathers a
culturally salient ordering of the words determined to be members.
We use this method to produce 12 scales of adjectives for use in
evaluation.
VerbCROcean: A Repository of Fine-GrainedSemantic Verb Relations for Croatian
Ivan Sekulic and Jan Šnajder
In this paper we describe VerbCROcean, a broad-coverage
repository of fine-grained semantic relations between Croatian
verbs. Adopting the methodology of Chklovski and Pantel
(2004) used for acquiring the English VerbOcean, we first acquire
semantically related verb pairs from a web corpus hrWaC by
relying on distributional similarity of subject-verb-object paths in
the dependency trees. We then classify the semantic relations
between each pair of verbs as similarity, intensity, antonymy, or
happens-before, using a number of manually-constructed lexico-
syntatic patterns. We evaluate the quality of the resulting resource
on a manually annotated sample of 1000 semantic verb relations.
The evaluation revealed that the predictions are most accurate for
the similarity relation, and least accurate for the intensity relation.
We make available two variants of VerbCROcean: a coverage-
oriented version, containing about 36k verb pairs at a precision of
41%, and a precision-oriented version containing about 5k verb
pairs, at a precision of 56%.
Enriching a Portuguese WordNet using Synonymsfrom a Monolingual Dictionary
Alberto Simões, Xavier Gómez Guinovart and José JoãoAlmeida
In this article we present an exploratory approach to enrich a
WordNet-like lexical ontology with the synonyms present in a
standard monolingual Portuguese dictionary. The dictionary was
converted from PDF into XML and senses were automatically
identified and annotated. This allowed us to extract them,
independently of definitions, and to create sets of synonyms
(synsets). These synsets were then aligned with WordNet
synsets, both in the same language (Portuguese) and projecting
the Portuguese terms into English, Spanish and Galician. This
process allowed both the addition of new term variants to existing
synsets, as to create new synsets for Portuguese.
O25 - Sentiment AnalysisThursday, May 26, 14:55
Chairperson: Frédérique Segond Oral Session
Reliable Baselines for Sentiment Analysis inResource-Limited Languages: The Serbian MovieReview Dataset
Vuk Batanovic, Boško Nikolic and Milan Milosavljevic
Collecting data for sentiment analysis in resource-limited
languages carries a significant risk of sample selection bias,
since the small quantities of available data are most likely not
representative of the whole population. Ignoring this bias leads to
less robust machine learning classifiers and less reliable evaluation
results. In this paper we present a dataset balancing algorithm
that minimizes the sample selection bias by eliminating irrelevant
systematic differences between the sentiment classes. We prove
its superiority over the random sampling method and we use
it to create the Serbian movie review dataset – SerbMR – the
first balanced and topically uniform sentiment analysis dataset in
Serbian. In addition, we propose an incremental way of finding
the optimal combination of simple text processing options and
machine learning features for sentiment classification. Several
popular classifiers are used in conjunction with this evaluation
93
approach in order to establish strong but reliable baselines for
sentiment analysis in Serbian.
ANTUSD: A Large Chinese Sentiment Dictionary
Shih-Ming Wang and Lun-Wei Ku
This paper introduces the augmented NTU sentiment dictionary,
abbreviated as ANTUSD, which is constructed by collecting
sentiment stats of words in several sentiment annotation work.
A total of 26,021 words were collected in ANTUSD. For each
word, the CopeOpi numerical sentiment score and the number of
positive annotation, neutral annotation, negative annotation, non-
opinionated annotation, and not-a-word annotation are provided.
Words and their sentiment information in ANTUSD have been
linked to the Chinese ontology E-HowNet to provide rich
semantic information. We demonstrate the usage of ANTUSD
in polarity classification of words, and the results show that a
superior f-score 98.21 is achieved, which supports the usefulness
of the ANTUSD. ANTUSD can be freely obtained through
application from NLPSA lab, Academia Sinica: http://
academiasinicanlplab.github.io/
Aspect based Sentiment Analysis in Hindi:Resource Creation and Evaluation
Md Shad Akhtar, Asif Ekbal and Pushpak Bhattacharyya
Due to the phenomenal growth of online product reviews,
sentiment analysis (SA) has gained huge attention, for example,
by online service providers. A number of benchmark datasets for
a wide range of domains have been made available for sentiment
analysis, especially in resource-rich languages. In this paper we
assess the challenges of SA in Hindi by providing a benchmark
setup, where we create an annotated dataset of high quality, build
machine learning models for sentiment analysis in order to show
the effective usage of the dataset, and finally make the resource
available to the community for further advancement of research.
The dataset comprises of Hindi product reviews crawled from
various online sources. Each sentence of the review is annotated
with aspect term and its associated sentiment. As classification
algorithms we use Conditional Random Filed (CRF) and Support
Vector Machine (SVM) for aspect term extraction and sentiment
analysis, respectively. Evaluation results show the average F-
measure of 41.07% for aspect term extraction and accuracy of
54.05% for sentiment classification.
Gulf Arabic Linguistic Resource Building forSentiment Analysis
Wafia Adouane and Richard Johansson
This paper deals with building linguistic resources for Gulf
Arabic, one of the Arabic variations, for sentiment analysis task
using machine learning. To our knowledge, no previous works
were done for Gulf Arabic sentiment analysis despite the fact
that it is present in different online platforms. Hence, the first
challenge is the absence of annotated data and sentiment lexicons.
To fill this gap, we created these two main linguistic resources.
Then we conducted different experiments: use Naive Bayes
classifier without any lexicon; add a sentiment lexicon designed
basically for MSA; use only the compiled Gulf Arabic sentiment
lexicon and finally use both MSA and Gulf Arabic sentiment
lexicons. The Gulf Arabic lexicon gives a good improvement
of the classifier accuracy (90.54 %) over a baseline that does
not use the lexicon (82.81%), while the MSA lexicon causes
the accuracy to drop to (76.83%). Moreover, mixing MSA and
Gulf Arabic lexicons causes the accuracy to drop to (84.94%)
compared to using only Gulf Arabic lexicon. This indicates that
it is useless to use MSA resources to deal with Gulf Arabic due
to the considerable differences and conflicting structures between
these two languages.
Using Data Mining Techniques for SentimentShifter Identification
Samira Noferesti and Mehrnoush Shamsfard
Sentiment shifters, i.e., words and expressions that can affect text
polarity, play an important role in opinion mining. However, the
limited ability of current automated opinion mining systems to
handle shifters represents a major challenge. The majority of
existing approaches rely on a manual list of shifters; few attempts
have been made to automatically identify shifters in text. Most of
them just focus on negating shifters. This paper presents a novel
and efficient semi-automatic method for identifying sentiment
shifters in drug reviews, aiming at improving the overall accuracy
of opinion mining systems. To this end, we use weighted
association rule mining (WARM), a well-known data mining
technique, for finding frequent dependency patterns representing
sentiment shifters from a domain-specific corpus. These patterns
that include different kinds of shifter words such as shifter verbs
and quantifiers are able to handle both local and long-distance
shifters. We also combine these patterns with a lexicon-based
approach for the polarity classification task. Experiments on
drug reviews demonstrate that extracted shifters can improve the
precision of the lexicon-based approach for polarity classification
9.25 percent.
94
O26 - Discourse and DialogueThursday, May 26, 14:55
Chairperson: Mark Liberman Oral Session
Discourse Structure and Dialogue Acts inMultiparty Dialogue: the STAC Corpus
Nicholas Asher, Julie Hunter, Mathieu Morey, BenamaraFarah and Stergos Afantenos
This paper describes the STAC resource, a corpus of multi-party
chats annotated for discourse structure in the style of SDRT (Asher
and Lascarides, 2003; Lascarides and Asher, 2009). The main
goal of the STAC project is to study the discourse structure
of multi-party dialogues in order to understand the linguistic
strategies adopted by interlocutors to achieve their conversational
goals, especially when these goals are opposed. The STAC corpus
is not only a rich source of data on strategic conversation, but also
the first corpus that we are aware of that provides full discourse
structures for multi-party dialogues. It has other remarkable
features that make it an interesting resource for other topics:
interleaved threads, creative language, and interactions between
linguistic and extra-linguistic contexts.
Purely Corpus-based Automatic ConversationAuthoring
Guillaume Dubuisson Duplessis, Vincent Letard, Anne-Laure Ligozat and Sophie Rosset
This paper presents an automatic corpus-based process to author
an open-domain conversational strategy usable both in chatterbot
systems and as a fallback strategy for out-of-domain human
utterances. Our approach is implemented on a corpus of television
drama subtitles. This system is used as a chatterbot system to
collect a corpus of 41 open-domain textual dialogues with 27
human participants. The general capabilities of the system are
studied through objective measures and subjective self-reports in
terms of understandability, repetition and coherence of the system
responses selected in reaction to human utterances. Subjective
evaluations of the collected dialogues are presented with respect
to amusement, engagement and enjoyability. The main factors
influencing those dimensions in our chatterbot experiment are
discussed.
Dialogue System Characterisation byBack-channelling Patterns Extracted fromDialogue Corpus
Masashi Inoue and Hiroshi Ueno
In this study, we describe the use of back-channelling patterns
extracted from a dialogue corpus as a mean to characterising text-
based dialogue systems. Our goal was to provide system users
with the feeling that they are interacting with distinct individuals
rather than artificially created characters. An analysis of the
corpus revealed that substantial difference exists among speakers
regarding the usage patterns of back-channelling. The patterns
consist of back-channelling frequency, types, and expressions.
They were used for system characterisation. Implemented system
characters were tested by asking users of the dialogue system to
identify the source speakers in the corpus. Experimental results
suggest that possibility of using back-channelling patterns alone
to characterize the dialogue system in some cases even among the
same age and gender groups.
Towards Automatic Identification of EffectiveClues for Team Word-Guessing Games
Eli Pincus and David Traum
Team word-guessing games where one player, the clue-giver,
gives clues attempting to elicit a target-word from another player,
the receiver, are a popular form of entertainment and also used for
educational purposes. Creating an engaging computational agent
capable of emulating a talented human clue-giver in a timed word-
guessing game depends on the ability to provide effective clues
(clues able to elicit a correct guess from a human receiver). There
are many available web resources and databases that can be mined
for the raw material for clues for target-words; however, a large
number of those clues are unlikely to be able to elicit a correct
guess from a human guesser. In this paper, we propose a method
for automatically filtering a clue corpus for effective clues for an
arbitrary target-word from a larger set of potential clues, using
machine learning on a set of features of the clues, including point-
wise mutual information between a clue’s constituent words and
a clue’s target-word. The results of the experiments significantly
improve the average clue quality over previous approaches, and
bring quality rates in-line with measures of human clue quality
derived from a corpus of human-human interactions. The paper
also introduces the data used to develop this method; audio
recordings of people making guesses after having heard the clues
being spoken by a synthesized voice.
Automatic Construction of Discourse Corpora forDialogue Translation
Longyue Wang, Xiaojun Zhang, Zhaopeng Tu, Andy Wayand Qun Liu
In this paper, a novel approach is proposed to automatically
construct parallel discourse corpus for dialogue machine
translation. Firstly, the parallel subtitle data and its corresponding
monolingual movie script data are crawled and collected from
Internet. Then tags such as speaker and discourse boundary from
the script data are projected to its subtitle data via an information
retrieval approach in order to map monolingual discourse to
95
bilingual texts. We not only evaluate the mapping results, but also
integrate speaker information into the translation. Experiments
show our proposed method can achieve 81.79% and 98.64%
accuracy on speaker and dialogue boundary annotation, and
speaker-based language model adaptation can obtain around 0.5
BLEU points improvement in translation qualities. Finally, we
publicly release around 100K parallel discourse data with manual
speaker and dialogue boundary annotation.
O27 - Machine Translation and Evaluation (2)Thursday, May 26, 14:55
Chairperson: Nizar Habash Oral Session
Using Contextual Information for MachineTranslation Evaluation
Marina Fomicheva and Núria Bel
Automatic evaluation of Machine Translation (MT) is typically
approached by measuring similarity between the candidate MT
and a human reference translation. An important limitation of
existing evaluation systems is that they are unable to distinguish
candidate-reference differences that arise due to acceptable
linguistic variation from the differences induced by MT errors.
In this paper we present a new metric, UPF-Cobalt, that addresses
this issue by taking into consideration the syntactic contexts of
candidate and reference words. The metric applies a penalty when
the words are similar but the contexts in which they occur are
not equivalent. In this way, Machine Translations (MTs) that are
different from the human translation but still essentially correct are
distinguished from those that share high number of words with the
reference but alter the meaning of the sentence due to translation
errors. The results show that the method proposed is indeed
beneficial for automatic MT evaluation. We report experiments
based on two different evaluation tasks with various types of
manual quality assessment. The metric significantly outperforms
state-of-the-art evaluation systems in varying evaluation settings.
Bootstrapping a Hybrid MT System to a NewLanguage Pair
João António Rodrigues, Nuno Rendeiro, Andreia Querido,Sanja Štajner and António Branco
The usual concern when opting for a rule-based or a hybrid
machine translation (MT) system is how much effort is required to
adapt the system to a different language pair or a new domain. In
this paper, we describe a way of adapting an existing hybrid MT
system to a new language pair, and show that such a system can
outperform a standard phrase-based statistical machine translation
system with an average of 10 persons/month of work. This is
specifically important in the case of domain-specific MT for which
there is not enough parallel data for training a statistical machine
translation system.
Filtering Wiktionary Triangles by LinearMbetween Distributed Word Models
Márton Makrai
Word translations arise in dictionary-like organization as well as
via machine learning from corpora. The former is exemplified
by Wiktionary, a crowd-sourced dictionary with editions in many
languages. Ács et al. (2013) obtain word translations from
Wiktionary with the pivot-based method, also called triangulation,
that infers word translations in a pair of languages based on
translations to other, typically better resourced ones called pivots.
Triangulation may introduce noise if words in the pivot are
polysemous. The reliability of each triangulated translation is
basically estimated by the number of pivot languages (Tanaka et
al 1994). Mikolov et al (2013) introduce a method for generating
or scoring word translations. Translation is formalized as a linear
mapping between distributed vector space models (VSM) of the
two languages. VSMs are trained on monolingual data, while
the mapping is learned in a supervised fashion, using a seed
dictionary of some thousand word pairs. The mapping can be
used to associate existing translations with a real-valued similarity
score. This paper exploits human labor in Wiktionary combined
with distributional information in VSMs. We train VSMs on
gigaword corpora, and the linear translation mapping on direct
(non-triangulated) Wiktionary pairs. This mapping is used to
filter triangulated translations based on scores. The motivation
is that scores by the mapping may be a smoother measure of
merit than considering only the number of pivot for the triangle.
We evaluate the scores against dictionaries extracted from parallel
corpora (Tiedemann 2012). We show that linear translation really
provides a more reliable method for triangle scoring than pivot
count. The methods we use are language-independent, and the
training data is easy to obtain for many languages. We chose
the German-Hungarian pair for evaluation, in which the filtered
triangles resulting from our experiments are the greatest freely
available list of word translations we are aware of.
Translation Errors and Incomprehensibility: aCase Study using Machine-Translated SecondLanguage Proficiency Tests
Takuya Matsuzaki, Akira Fujita, Naoya Todo and Noriko H.Arai
This paper reports on an experiment where 795 human
participants answered to the questions taken from second language
proficiency tests that were translated to their native language. The
96
output of three machine translation systems and two different
human translations were used as the test material. We classified
the translation errors in the questions according to an error
taxonomy and analyzed the participants’ response on the basis
of the type and frequency of the translation errors. Through the
analysis, we identified several types of errors that deteriorated
most the accuracy of the participants’ answers, their confidence on
the answers, and their overall evaluation of the translation quality.
Word Sense-Aware Machine Translation:Including Senses as Contextual Features forImproved Translation Models
Steven Neale, Luís Gomes, Eneko Agirre, Oier Lopez deLacalle and António Branco
Although it is commonly assumed that word sense disambiguation
(WSD) should help to improve lexical choice and improve the
quality of machine translation systems, how to successfully
integrate word senses into such systems remains an unanswered
question. Some successful approaches have involved
reformulating either WSD or the word senses it produces, but
work on using traditional word senses to improve machine
translation have met with limited success. In this paper,
we build upon previous work that experimented on including
word senses as contextual features in maxent-based translation
models. Training on a large, open-domain corpus (Europarl), we
demonstrate that this aproach yields significant improvements in
machine translation from English to Portuguese.
O28 - Corpus Querying and CrawlingThursday, May 26, 14:55
Chairperson: Tomaž Erjavec Oral Session
SuperCAT: The (New and Improved) CorpusAnalysis Toolkit
K. Bretonnel Cohen, William A. Baumgartner Jr. and IrinaTemnikova
This paper reports SuperCAT, a corpus analysis toolkit. It is a
radical extension of SubCAT, the Sublanguage Corpus Analysis
Toolkit, from sublanguage analysis to corpus analysis in general.
The idea behind SuperCAT is that representative corpora have
no tendency towards closure—that is, they tend towards infinity.
In contrast, non-representative corpora have a tendency towards
closure—roughly, finiteness. SuperCAT focuses on general
techniques for the quantitative description of the characteristics of
any corpus (or other language sample), particularly concerning the
characteristics of lexical distributions. Additionally, SuperCAT
features a complete re-engineering of the previous SubCAT
architecture.
LanguageCrawl: A Generic Tool for BuildingLanguage Models Upon Common-Crawl
Szymon Roziewski and Wojciech Stokowiec
The web data contains immense amount of data, hundreds
of billion words are waiting to be extracted and used for
language research. In this work we introduce our tool
LanguageCrawl which allows NLP researchers to easily construct
web-scale corpus from Common Crawl Archive: a petabyte
scale, open repository of web crawl information. Three use-
cases are presented: filtering Polish websites, building an N-gram
corpora and training continuous skip-gram language model with
hierarchical softmax. Each of them has been implemented within
the LanguageCrawl toolkit, with the possibility to adjust specified
language and N-gram ranks. Special effort has been put on high
computing efficiency, by applying highly concurrent multitasking.
We make our tool publicly available to enrich NLP resources.
We strongly believe that our work will help to facilitate NLP
research, especially in under-resourced languages, where the lack
of appropriately sized corpora is a serious hindrance to applying
data-intensive methods, such as deep neural networks.
Features for Generic Corpus Querying
Thomas Eckart, Christoph Kuras and Uwe Quasthoff
The availability of large corpora for more and more languages
enforces generic querying and standard interfaces. This
development is especially relevant in the context of integrated
research environments like CLARIN or DARIAH. The paper
focuses on several applications and implementation details on
the basis of a unified corpus format, a unique POS tag set,
and prepared data for word similarities. All described data or
applications are already or will be in the near future accessible
via well-documented RESTful Web services. The target group are
all kinds of interested persons with varying level of experience in
programming or corpus query languages.
European Union Language Resources in SketchEngine
Vít Baisa, Jan Michelfeit, Marek Medved’ and MilosJakubicek
Several parallel corpora built from European Union language
resources are presented here. They were processed by state-of-the-
art tools and made available for researchers in the corpus manager
Sketch Engine. A completely new resource is introduced: EUR-
Lex Corpus, being one of the largest parallel corpus available at
the moment, containing 840 million English tokens and the largest
97
language pair English-French has more than 25 million aligned
segments (paragraphs).
Corpus Query Lingua Franca (CQLF)
Piotr Banski, Elena Frick and Andreas Witt
The present paper describes Corpus Query Lingua Franca (ISO
CQLF), a specification designed at ISO Technical Committee
37 Subcommittee 4 “Language resource management” for the
purpose of facilitating the comparison of properties of corpus
query languages. We overview the motivation for this endeavour
and present its aims and its general architecture. CQLF is intended
as a multi-part specification; here, we concentrate on the basic
metamodel that provides a frame that the other parts fit in.
P35 - Grammar and SyntaxThursday, May 26, 16:55
Chairperson: Maria Simi Poster Session
A sense-based lexicon of count and massexpressions: The Bochum English CountabilityLexicon
Tibor Kiss, Francis Jeffry Pelletier, Halima Husic, RomanNino Simunic and Johanna Marie Poppek
The present paper describes the current release of the Bochum
English Countability Lexicon (BECL 2.1), a large empirical
database consisting of lemmata from Open ANC (http://
www.anc.org) with added senses from WordNet (Fellbaum
1998). BECL 2.1 contains 11,800 annotated noun-sense
pairs, divided in four major countability classes and 18 fine-
grained subclasses. In the current version, BECL also provides
information on nouns whose senses occur in more than one
class allowing a closer look on polysemy and homonymy with
regard to countability. Further included are sets of similar
senses using the Leacock and Chodorow (LCH) score for
semantic similarity (Leacock & Chodorow 1998), information
on orthographic variation, on the completeness of all WordNet
senses in the database and an annotated representation of different
types of proper names. The further development of BECL will
investigate the different countability classes of proper names and
the general relation between semantic similarity and countability
as well as recurring syntactic patterns for noun-sense pairs. The
BECL 2.1 database is also publicly available via http://
count-and-mass.org.
Detecting Optional Arguments of Verbs
Andras Kornai, Dávid Márk Nemeskey and Gábor Recski
We propose a novel method for detecting optional arguments of
Hungarian verbs using only positive data. We introduce a custom
variant of collexeme analysis that explicitly models the noise in
verb frames. Our method is, for the most part, unsupervised:
we use the spectral clustering algorithm described in Brew and
Schulte in Walde (2002) to build a noise model from a short,
manually verified seed list of verbs. We experimented with
both raw count- and context-based clusterings and found their
performance almost identical. The code for our algorithm and the
frame list are freely available at http://hlt.bme.hu/en/
resources/tade.
Leveraging Native Data to Correct PrepositionErrors in Learners’ Dutch
Lennart Kloppenburg and Malvina Nissim
We address the task of automatically correcting preposition errors
in learners’ Dutch by modelling preposition usage in native
language. Specifically, we build two models exploiting a large
corpus of Dutch. The first is a binary model for detecting whether
a preposition should be used at all in a given position or not.
The second is a multiclass model for selecting the appropriate
preposition in case one should be used. The models are tested
on native as well as learners data. For the latter we exploit a
crowdsourcing strategy to elicit native judgements. On native test
data the models perform very well, showing that we can model
preposition usage appropriately. However, the evaluation on
learners’ data shows that while detecting that a given preposition
is wrong is doable reasonably well, detecting the absence of a
preposition is a lot more difficult. Observing such results and
the data we deal with, we envisage various ways of improving
performance, and report them in the final section of this article.
D(H)ante: A New Set of Tools for XIII CenturyItalian
Angelo Basile and Federico Sangati
In this paper we describe 1) the process of converting a corpus of
Dante Alighieri from a TEI XML format in to a pseudo-CoNLL
format; 2) how a pos-tagger trained on modern Italian performs on
Dante’s Italian 3) the performances of two different pos-taggers
trained on the given corpus. We are making our conversion
scripts and models available to the community. The two other
models trained on the corpus performs reasonably well. The tool
used for the conversion process might turn useful for bridging
the gap between traditional digital humanities and modern NLP
applications since the TEI original format is not usually suitable
for being processed with standard NLP tools. We believe our work
will serve both communities: the DH community will be able to
tag new documents and the NLP world will have an easier way in
98
converting existing documents to a standardized machine-readable
format.
Multilevel Annotation of Agreement andDisagreement in Italian News Blogs
Fabio Celli, Giuseppe Riccardi and Firoj Alam
In this paper, we present a corpus of news blog conversations
in Italian annotated with gold standard agreement/disagreement
relations at message and sentence levels. This is the first resource
of this kind in Italian. From the analysis of ADRs at the two
levels emerged that agreement annotated at message level is
consistent and generally reflected at sentence level, moreover, the
argumentation structure of disagreement is more complex than
agreement. The manual error analysis revealed that this resource
is useful not only for the analysis of argumentation, but also for
the detection of irony/sarcasm in online debates. The corpus and
annotation tool are available for research purposes on request.
Sentence Similarity based on Dependency TreeKernels for Multi-document Summarization
Saziye Betül Özates, Arzucan Özgür and Dragomir Radev
We introduce an approach based on using the dependency
grammar representations of sentences to compute sentence
similarity for extractive multi-document summarization. We
adapt and investigate the effects of two untyped dependency
tree kernels, which have originally been proposed for relation
extraction, to the multi-document summarization problem. In
addition, we propose a series of novel dependency grammar based
kernels to better represent the syntactic and semantic similarities
among the sentences. The proposed methods incorporate the type
information of the dependency relations for sentence similarity
calculation. To our knowledge, this is the first study that
investigates using dependency tree based sentence similarity for
multi-document summarization.
Discontinuous Verb Phrases in Parsing andMachine Translation of English and German
Sharid Loáiciga and Kristina Gulordava
In this paper, we focus on the verb-particle (V-Prt) split
construction in English and German and its difficulty for parsing
and Machine Translation (MT). For German, we use an existing
test suite of V-Prt split constructions, while for English, we
build a new and comparable test suite from raw data. These
two data sets are then used to perform an analysis of errors in
dependency parsing, word-level alignment and MT, which arise
from the discontinuous order in V-Prt split constructions. In the
automatic alignments of parallel corpora, most of the particles
align to NULL. These mis-alignments and the inability of phrase-
based MT system to recover discontinuous phrases result in low
quality translations of V-Prt split constructions both in English and
German. However, our results show that the V-Prt split phrases
are correctly parsed in 90% of cases, suggesting that syntactic-
based MT should perform better on these constructions. We
evaluate a syntactic-based MT system on German and compare
its performance to the phrase-based system.
A Lexical Resource for the Identification of “WeakWords” in German Specification Documents
Jennifer Krisch, Melanie Dick, Ronny Jauch and UlrichHeid
We report on the creation of a lexical resource for the identification
of potentially unspecific or imprecise constructions in German
requirements documentation from the car manufacturing industry.
In requirements engineering, such expressions are called “weak
words”: they are not sufficiently precise to ensure an unambiguous
interpretation by the contractual partners, who for the definition
of their cooperation, typically rely on specification documents
(Melchisedech, 2000); an example are dimension adjectives, such
as kurz or lang (‘short’, ‘long’) which need to be modified
by adverbials indicating the exact duration, size etc. Contrary
to standard practice in requirements engineering, where the
identification of such weak words is merely based on stopword
lists, we identify weak uses in context, by querying annotated text.
The queries are part of the resource, as they define the conditions
when a word use is weak. We evaluate the recognition of weak
uses on our development corpus and on an unseen evaluation
corpus, reaching stable F1-scores above 0.95.
Recent Advances in Development of aLexicon-Grammar of Polish: PolNet 3.0
Zygmunt Vetulani, Grazyna Vetulani and BartłomiejKochanowski
The granularity of PolNet (Polish Wordnet) is the main theoretical
issue discussed in the paper. We describe the latest extension of
PolNet including valency information of simple verbs and noun-
verb collocations using manual and machine-assisted methods.
Valency is defined to include both semantic and syntactic
selectional restrictions. We assume the valency structure of a
verb to be an index of meaning. Consistently we consider it an
attribute of a synset. Strict application of this principle results in
fine granularity of the verb section of the wordnet. Considering
valency as a distinctive feature of synsets was an essential step to
transform the initial PolNet (first intended as a lexical ontology)
into a lexicon-grammar. For the present refinement of PolNet we
assume that the category of language register is a part of meaning.
99
The totality of PolNet 2.0 synsets is being revised in order to split
the PolNet 2.0 synsets that contain different register words into
register-uniform sub-synsets. We completed this operation for
synsets that were used as values of semantic roles. The operation
augmented the number of considered synsets by 29%. In the
paper we report an extension of the class of collocation-based verb
synsets.
C-WEP–Rich Annotated Collection of WritingErrors by Professionals
Cerstin Mahlow
This paper presents C-WEP, the Collection of Writing Errors
by Professionals Writers of German. It currently consists of
245 sentences with grammatical errors. All sentences are taken
from published texts. All authors are professional writers with
high skill levels with respect to German, the genres, and the
topics. The purpose of this collection is to provide seeds for more
sophisticated writing support tools as only a very small proportion
of those errors can be detected by state-of-the-art checkers. C-
WEP is annotated on various levels and freely available.
Improving corpus search via parsing
Natalia Klyueva and Pavel Stranák
In this paper, we describe an addition to the corpus query
system Kontext that enables to enhance the search using syntactic
attributes in addition to the existing features, mainly lemmas and
morphological categories. We present the enhancements of the
corpus query system itself, the attributes we use to represent
syntactic structures in data, and some examples of querying the
syntactically annotated corpora, such as treebanks in various
languages as well as an automatically parsed large corpus.
P36 - Sentiment Analysis and OpinionMining (2)Thursday, May 26, 16:55
Chairperson: Manfred Stede Poster Session
Affective Lexicon Creation for the GreekLanguage
Elisavet Palogiannidi, Polychronis Koutsakis, Elias Iosifand Alexandros Potamianos
Starting from the English affective lexicon ANEW (Bradley and
Lang, 1999a) we have created the first Greek affective lexicon.
It contains human ratings for the three continuous affective
dimensions of valence, arousal and dominance for 1034 words.
The Greek affective lexicon is compared with affective lexica in
English, Spanish and Portuguese. The lexicon is automatically
expanded by selecting a small number of manually annotated
words to bootstrap the process of estimating affective ratings
of unknown words. We experimented with the parameters of
the semantic-affective model in order to investigate their impact
to its performance, which reaches 85% binary classification
accuracy (positive vs. negative ratings). We share the Greek
affective lexicon that consists of 1034 words and the automatically
expanded Greek affective lexicon that contains 407K words.
A Hungarian Sentiment Corpus ManuallyAnnotated at Aspect LevelMartina Katalin Szabó, Veronika Vincze, Katalin IlonaSimkó, Viktor Varga and Viktor Hangya
In this paper we present a Hungarian sentiment corpus manually
annotated at aspect level. Our corpus consists of Hungarian
opinion texts written about different types of products. The main
aim of creating the corpus was to produce an appropriate database
providing possibilities for developing text mining software tools.
The corpus is a unique Hungarian database: to the best of
our knowledge, no digitized Hungarian sentiment corpus that is
annotated on the level of fragments and targets has been made so
far. In addition, many language elements of the corpus, relevant
from the point of view of sentiment analysis, got distinct types
of tags in the annotation. In this paper, on the one hand, we
present the method of annotation, and we discuss the difficulties
concerning text annotation process. On the other hand, we provide
some quantitative and qualitative data on the corpus. We conclude
with a description of the applicability of the corpus.
Effect Functors for Opinion InferenceJosef Ruppenhofer and Jasper Brandes
Sentiment analysis has so far focused on the detection of explicit
opinions. However, of late implicit opinions have received broader
attention, the key idea being that the evaluation of an event type by
a speaker depends on how the participants in the event are valued
and how the event itself affects the participants. We present an
annotation scheme for adding relevant information, couched in
terms of so-called effect functors, to German lexical items. Our
scheme synthesizes and extends previous proposals. We report
on an inter-annotator agreement study. We also present results
of a crowdsourcing experiment to test the utility of some known
and some new functors for opinion inference where, unlike in
previous work, subjects are asked to reason from event evaluation
to participant evaluation.
Sentiframes: A Resource for Verb-centeredGerman Sentiment InferenceManfred Klenner and Michael Amsler
In this paper, a German verb resource for verb-centered sentiment
inference is introduced and evaluated. Our model specifies verb
100
polarity frames that capture the polarity effects on the fillers of
the verb’s arguments given a sentence with that verb frame. Verb
signatures and selectional restrictions are also part of the model.
An algorithm to apply the verb resource to treebank sentences and
the results of our first evaluation are discussed.
Annotating Sentiment and Irony in the OnlineItalian Political Debate on #labuonascuola
Marco Stranisci, Cristina Bosco, Delia Irazú HernándezFarías and Viviana Patti
In this paper we present the TWitterBuonaScuola corpus (TW-
BS), a novel Italian linguistic resource for Sentiment Analysis,
developed with the main aim of analyzing the online debate
on the controversial Italian political reform “Buona Scuola”
(Good school), aimed at reorganizing the national educational
and training systems. We describe the methodologies applied
in the collection and annotation of data. The collection has
been driven by the detection of the hashtags mainly used by
the participants to the debate, while the annotation has been
focused on sentiment polarity and irony, but also extended to
mark the aspects of the reform that were mainly discussed in the
debate. An in-depth study of the disagreement among annotators
is included. We describe the collection and annotation stages, and
the in-depth analysis of disagreement made with Crowdflower, a
crowdsourcing annotation platform.
NileULex: A Phrase and Word Level SentimentLexicon for Egyptian and Modern StandardArabic
Samhaa R. El-Beltagy
This paper presents NileULex, which is an Arabic sentiment
lexicon containing close to six thousands Arabic words and
compound phrases. Forty five percent of the terms and expressions
in the lexicon are Egyptian or colloquial while fifty five percent
are Modern Standard Arabic. While the collection of many of
the terms included in the lexicon was done automatically, the
actual addition of any term was done manually. One of the
important criterions for adding terms to the lexicon, was that they
be as unambiguous as possible. The result is a lexicon with a
much higher quality than any translated variant or automatically
constructed one. To demonstrate that a lexicon such as this
can directly impact the task of sentiment analysis, a very basic
machine learning based sentiment analyser that uses unigrams,
bigrams, and lexicon based features was applied on two different
Twitter datasets. The obtained results were compared to a baseline
system that only uses unigrams and bigrams. The same lexicon
based features were also generated using a publicly available
translation of a popular sentiment lexicon. The experiments show
that usage of the developed lexicon improves the results over both
the baseline and the publicly available lexicon.
OPFI: A Tool for Opinion Finding in PolishAleksander Wawer
The paper contains a description of OPFI: Opinion Finder for
the Polish Language, a freely available tool for opinion target
extraction. The goal of the tool is opinion finding: a task of
identifying tuples composed of sentiment (positive or negative)
and its target (about what or whom is the sentiment expressed).
OPFI is not dependent on any particular method of sentiment
identification and provides a built-in sentiment dictionary as a
convenient option. Technically, it contains implementations of
three different modes of opinion tuple generation: one hybrid
based on dependency parsing and CRF, the second based on
shallow parsing and the third on deep learning, namely GRU
neural network. The paper also contains a description of related
language resources: two annotated treebanks and one set of
tweets.
Rude waiter but mouthwatering pastries! Anexploratory study into Dutch Aspect-BasedSentiment AnalysisOrphee De Clercq and Veronique Hoste
The fine-grained task of automatically detecting all sentiment
expressions within a given document and the aspects to which
they refer is known as aspect-based sentiment analysis. In this
paper we present the first full aspect-based sentiment analysis
pipeline for Dutch and apply it to customer reviews. To this
purpose, we collected reviews from two different domains, i.e.
restaurant and smartphone reviews. Both corpora have been
manually annotated using newly developed guidelines that comply
to standard practices in the field. For our experimental pipeline
we perceive aspect-based sentiment analysis as a task consisting
of three main subtasks which have to be tackled incrementally:
aspect term extraction, aspect category classification and polarity
classification. First experiments on our Dutch restaurant corpus
reveal that this is indeed a feasible approach that yields promising
results.
P37 - Parallel and Comparable CorporaThursday, May 26, 16:55
Chairperson: Jörg Tiedemann Poster Session
Building A Case-based Semantic English-ChineseParallel TreebankHuaxing Shi, Tiejun Zhao and Keh-Yih Su
We construct a case-based English-to-Chinese semantic
constituent parallel Treebank for a Statistical Machine Translation
(SMT) task by labelling each node of the Deep Syntactic Tree
101
(DST) with our refined semantic cases. Since subtree span-
crossing is harmful in tree-based SMT, DST is adopted to alleviate
this problem. At the same time, we tailor an existing case set to
represent bilingual shallow semantic relations more precisely.
This Treebank is a part of a semantic corpus building project,
which aims to build a semantic bilingual corpus annotated with
syntactic, semantic cases and word senses. Data in our Treebank
is from the news domain of Datum corpus. 4,000 sentence pairs
are selected to cover various lexicons and part-of-speech (POS)
n-gram patterns as much as possible. This paper presents the
construction of this case Treebank. Also, we have tested the
effect of adopting DST structure in alleviating subtree span-
crossing. Our preliminary analysis shows that the compatibility
between Chinese and English trees can be significantly increased
by transforming the parse-tree into the DST. Furthermore, the
human agreement rate in annotation is found to be acceptable
(90% in English nodes, 75% in Chinese nodes).
Uzbek-English and Turkish-English MorphemeAlignment Corpora
Xuansong Li, Jennifer Tracey, Stephen Grimes andStephanie Strassel
Morphologically-rich languages pose problems for machine
translation (MT) systems, including word-alignment errors, data
sparsity and multiple affixes. Current alignment models at word-
level do not distinguish words and morphemes, thus yielding
low-quality alignment and subsequently affecting end translation
quality. Models using morpheme-level alignment can reduce the
vocabulary size of morphologically-rich languages and overcomes
data sparsity. The alignment data based on smallest units
reveals subtle language features and enhances translation quality.
Recent research proves such morpheme-level alignment (MA)
data to be valuable linguistic resources for SMT, particularly for
languages with rich morphology. In support of this research
trend, the Linguistic Data Consortium (LDC) created Uzbek-
English and Turkish-English alignment data which are manually
aligned at the morpheme level. This paper describes the
creation of MA corpora, including alignment and tagging process
and approaches, highlighting annotation challenges and specific
features of languages with rich morphology. The light tagging
annotation on the alignment layer adds extra value to the MA
data, facilitating users in flexibly tailoring the data for various MT
model training.
Parallel Sentence Extraction from ComparableCorpora with Neural Network Features
Chenhui Chu, Raj Dabre and Sadao Kurohashi
Parallel corpora are crucial for machine translation (MT), however
they are quite scarce for most language pairs and domains. As
comparable corpora are far more available, many studies have
been conducted to extract parallel sentences from them for MT. In
this paper, we exploit the neural network features acquired from
neural MT for parallel sentence extraction. We observe significant
improvements for both accuracy in sentence extraction and MT
performance.
TweetMT: A Parallel Microblog Corpus
Iñaki San Vicente, Iñaki Alegria, Cristina España-Bonet, Pablo Gamallo, Hugo Gonçalo Oliveira, EvaMartinez Garcia, Antonio Toral, Arkaitz Zubiaga and NoraAranberri
We introduce TweetMT, a parallel corpus of tweets in four
language pairs that combine five languages (Spanish from/to
Basque, Catalan, Galician and Portuguese), all of which have
an official status in the Iberian Peninsula. The corpus has been
created by combining automatic collection and crowdsourcing
approaches, and it is publicly available. It is intended for
the development and testing of microtext machine translation
systems. In this paper we describe the methodology followed to
build the corpus, and present the results of the shared task in which
it was tested.
The Scielo Corpus: a Parallel Corpus of ScientificPublications for Biomedicine
Mariana Neves, Antonio Jimeno Yepes and Aurélie Névéol
The biomedical scientific literature is a rich source of information
not only in the English language, for which it is more abundant,
but also in other languages, such as Portuguese, Spanish
and French. We present the first freely available parallel
corpus of scientific publications for the biomedical domain.
Documents from the ”Biological Sciences” and ”Health Sciences”
categories were retrieved from the Scielo database and parallel
titles and abstracts are available for the following language
pairs: Portuguese/English (about 86,000 documents in total),
Spanish/English (about 95,000 documents) and French/English
(about 2,000 documents). Additionally, monolingual data was
also collected for all four languages. Sentences in the parallel
corpus were automatically aligned and a manual analysis of 200
documents by native experts found that a minimum of 79%
of sentences were correctly aligned in all language pairs. We
demonstrate the utility of the corpus by running baseline machine
translation experiments. We show that for all language pairs,
a statistical machine translation system trained on the parallel
corpora achieves performance that rivals or exceeds the state of
the art in the biomedical domain. Furthermore, the corpora are
102
currently being used in the biomedical task in the First Conference
on Machine Translation (WMT’16).
Producing Monolingual and Parallel WebCorpora at the Same Time - SpiderLing andBitextor’s Love Affair
Nikola Ljubešic, Miquel Esplà-Gomis, Antonio Toral,Sergio Ortiz Rojas and Filip Klubicka
This paper presents an approach for building large monolingual
corpora and, at the same time, extracting parallel data by crawling
the top-level domain of a given language of interest. For gathering
linguistically relevant data from top-level domains we use the
SpiderLing crawler, modified to crawl data written in multiple
languages. The output of this process is then fed to Bitextor, a tool
for harvesting parallel data from a collection of documents. We
call the system combining these two tools Spidextor, a blend of the
names of its two crucial parts. We evaluate the described approach
intrinsically by measuring the accuracy of the extracted bitexts
from the Croatian top-level domain “.hr” and the Slovene top-level
domain “.si”, and extrinsically on the English-Croatian language
pair by comparing an SMT system built from the crawled data
with third-party systems. We finally present parallel datasets
collected with our approach for the English-Croatian, English-
Finnish, English-Serbian and English-Slovene language pairs.
P38 - Social MediaThursday, May 26, 16:55
Chairperson: Fei Xia Poster Session
Towards Using Social Media to IdentifyIndividuals at Risk for Preventable ChronicIllness
Dane Bell, Daniel Fried, Luwen Huangfu, Mihai Surdeanuand Stephen Kobourov
We describe a strategy for the acquisition of training data
necessary to build a social-media-driven early detection system
for individuals at risk for (preventable) type 2 diabetes mellitus
(T2DM). The strategy uses a game-like quiz with data and
questions acquired semi-automatically from Twitter. The
questions are designed to inspire participant engagement and
collect relevant data to train a public-health model applied to
individuals. Prior systems designed to use social media such
as Twitter to predict obesity (a risk factor for T2DM) operate
on entire communities such as states, counties, or cities, based
on statistics gathered by government agencies. Because there
is considerable variation among individuals within these groups,
training data on the individual level would be more effective,
but this data is difficult to acquire. The approach proposed
here aims to address this issue. Our strategy has two steps.
First, we trained a random forest classifier on data gathered from
(public) Twitter statuses and state-level statistics with state-of-the-
art accuracy. We then converted this classifier into a 20-questions-
style quiz and made it available online. In doing so, we achieved
high engagement with individuals that took the quiz, while also
building a training set of voluntarily supplied individual-level data
for future classification.
Can Tweets Predict TV Ratings?Bridget Sommerdijk, Eric Sanders and Antal van den Bosch
We set out to investigate whether TV ratings and mentions of TV
programmes on the Twitter social media platform are correlated.
If such a correlation exists, Twitter may be used as an alternative
source for estimating viewer popularity. Moreover, the Twitter-
based rating estimates may be generated during the programme,
or even before. We count the occurrences of programme-specific
hashtags in an archive of Dutch tweets of eleven popular TV
shows broadcast in the Netherlands in one season, and perform
correlation tests. Overall we find a strong correlation of 0.82; the
correlation remains strong, 0.79, if tweets are counted a half hour
before broadcast time. However, the two most popular TV shows
account for most of the positive effect; if we leave out the single
and second most popular TV shows, the correlation drops to being
moderate to weak. Also, within a TV show, correlations between
ratings and tweet counts are mostly weak, while correlations
between TV ratings of the previous and next shows are strong. In
absence of information on previous shows, Twitter-based counts
may be a viable alternative to classic estimation methods for
TV ratings. Estimates are more reliable with more popular TV
shows.
Classifying Out-of-vocabulary Terms in aDomain-Specific Social Media CorpusSoHyun Park, Afsaneh Fazly, Annie Lee, Brandon Seibel,Wenjie Zi and Paul Cook
In this paper we consider the problem of out-of-vocabulary term
classification in web forum text from the automotive domain. We
develop a set of nine domain- and application-specific categories
for out-of-vocabulary terms. We then propose a supervised
approach to classify out-of-vocabulary terms according to these
categories, drawing on features based on word embeddings, and
linguistic knowledge of common properties of out-of-vocabulary
terms. We show that the features based on word embeddings
are particularly informative for this task. The categories that
we predict could serve as a preliminary, automatically-generated
source of lexical knowledge about out-of-vocabulary terms.
Furthermore, we show that this approach can be adapted to give a
semi-automated method for identifying out-of-vocabulary terms
103
of a particular category, automotive named entities, that is of
particular interest to us.
Corpus for Customer Purchase BehaviorPrediction in Social MediaShigeyuki Sakaki, Francine Chen, Mandy Korpusik andYan-Ying Chen
Many people post about their daily life on social media. These
posts may include information about the purchase activity of
people, and insights useful to companies can be derived from
them: e.g. profile information of a user who mentioned something
about their product. As a further advanced analysis, we consider
extracting users who are likely to buy a product from the set of
users who mentioned that the product is attractive. In this paper,
we report our methodology for building a corpus for Twitter user
purchase behavior prediction. First, we collected Twitter users
who posted a want phrase + product name: e.g. “want a Xperia” as
candidate want users, and also candidate bought users in the same
way. Then, we asked an annotator to judge whether a candidate
user actually bought a product. We also annotated whether tweets
randomly sampled from want/bought user timelines are relevant
or not to purchase. In this annotation, 58% of want user tweets
and 35% of bought user tweets were annotated as relevant. Our
data indicate that information embedded in timeline tweets can be
used to predict purchase behavior of tweeted products.
Segmenting Hashtags using AutomaticallyCreated Training DataArda Celebi and Arzucan Özgür
Hashtags, which are commonly composed of multiple words,
are increasingly used to convey the actual messages in tweets.
Understanding what tweets are saying is getting more dependent
on understanding hashtags. Therefore, identifying the individual
words that constitute a hashtag is an important, yet a challenging
task due to the abrupt nature of the language used in tweets.
In this study, we introduce a feature-rich approach based on
using supervised machine learning methods to segment hashtags.
Our approach is unsupervised in the sense that instead of
using manually segmented hashtags for training the machine
learning classifiers, we automatically create our training data
by using tweets as well as by automatically extracting hashtag
segmentations from a large corpus. We achieve promising results
with such automatically created noisy training data.
Exploring Language Variation Across Europe - AWeb-based Tool for ComputationalSociolinguisticsDirk Hovy and Anders Johannsen
Language varies not only between countries, but also along
regional and socio-demographic lines. This variation is one of the
driving factors behind language change. However, investigating
language variation is a complex undertaking: the more factors we
want to consider, the more data we need. Traditional qualitative
methods are not well-suited to do this, an therefore restricted
to isolated factors. This reduction limits the potential insights,
and risks attributing undue importance to easily observed factors.
While there is a large interest in linguistics to increase the
quantitative aspect of such studies, it requires training in both
variational linguistics and computational methods, a combination
that is still not common. We take a first step here to alleviating the
problem by providing an interface, www.languagevariation.com,
to explore large-scale language variation along multiple socio-
demographic factors – without programming knowledge. It makes
use of large amounts of data and provides statistical analyses,
maps, and interactive features that will enable scholars to explore
language variation in a data-driven way.
Predicting Author Age from Weibo MicroblogPosts
Wanru Zhang, Andrew Caines, Dimitrios Alikaniotis andPaula Buttery
We report an author profiling study based on Chinese social media
texts gleaned from Sina Weibo in which we attempt to predict
the author’s age group based on various linguistic text features
mainly relating to non-standard orthography: classical Chinese
characters, hashtags, emoticons and kaomoji, homogeneous
punctuation and Latin character sequences, and poetic format.
We also tracked the use of selected popular Chinese expressions,
parts-of-speech and word types. We extracted 100 posts from
100 users in each of four age groups (under-18, 19-29, 30-39,
over-40 years) and by clustering users’ posts fifty at a time we
trained a maximum entropy classifier to predict author age group
to an accuracy of 65.5%. We show which features are associated
with younger and older age groups, and make our normalisation
resources available to other researchers.
Effects of Sampling on Twitter Trend Detection
Andrew Yates, Alek Kolcz, Nazli Goharian and OphirFrieder
Much research has focused on detecting trends on Twitter,
including health-related trends such as mentions of Influenza-like
illnesses or their symptoms. The majority of this research has been
conducted using Twitter’s public feed, which includes only about
1% of all public tweets. It is unclear if, when, and how using
Twitter’s 1% feed has affected the evaluation of trend detection
methods. In this work we use a larger feed to investigate the
effects of sampling on Twitter trend detection. We focus on using
health-related trends to estimate the prevalence of Influenza-like
104
illnesses based on tweets. We use ground truth obtained from
the CDC and Google Flu Trends to explore how the prevalence
estimates degrade when moving from a 100% to a 1% sample.
We find that using the 1% sample is unlikely to substantially
harm ILI estimates made at the national level, but can cause poor
performance when estimates are made at the city level.
Automatic Classification of Tweets for AnalyzingCommunication Behavior of Museums
Nicolas Foucault and Antoine Courtin
In this paper, we present a study on tweet classification which
aims to define the communication behavior of the 103 French
museums that participated in 2014 in the Twitter operation:
MuseumWeek. The tweets were automatically classified in
four communication categories: sharing experience, promoting
participation, interacting with the community, and promoting-
informing about the institution. Our classification is multi-class.
It combines Support Vector Machines and Naive Bayes methods
and is supported by a selection of eighteen subtypes of features
of four different kinds: metadata information, punctuation marks,
tweet-specific and lexical features. It was tested against a corpus
of 1,095 tweets manually annotated by two experts in Natural
Language Processing and Information Communication and twelve
Community Managers of French museums. We obtained an state-
of-the-art result of F1-score of 72% by 10-fold cross-validation.
This result is very encouraging since is even better than some
state-of-the-art results found in the tweet classification literature.
P39 - Word Sense Disambiguation (2)Thursday, May 26, 16:55
Chairperson: Elisabetta Jezek Poster Session
Graph-Based Induction of Word Senses inCroatian
Marko Bekavac and Jan Šnajder
Word sense induction (WSI) seeks to induce senses of words
from unannotated corpora. In this paper, we address the WSI
task for the Croatian language. We adopt the word clustering
approach based on co-occurrence graphs, in which senses are
taken to correspond to strongly inter-connected components of
co-occurring words. We experiment with a number of graph
construction techniques and clustering algorithms, and evaluate
the sense inventories both as a clustering problem and extrinsically
on a word sense disambiguation (WSD) task. In the cluster-
based evaluation, Chinese Whispers algorithm outperformed
Markov Clustering, yielding a normalized mutual information
score of 64.3. In contrast, in WSD evaluation Markov Clustering
performed better, yielding an accuracy of about 75%. We are
making available two induced sense inventories of 10,000 most
frequent Croatian words: one coarse-grained and one fine-grained
inventory, both obtained using the Markov Clustering algorithm.
A Multi-domain Corpus of Swedish Word SenseAnnotationRichard Johansson, Yvonne Adesam, Gerlof Bouma andKarin Hedberg
We describe the word sense annotation layer in Eukalyptus, a
freely available five-domain corpus of contemporary Swedish
with several annotation layers. The annotation uses the SALDO
lexicon to define the sense inventory, and allows word sense
annotation of compound segments and multiword units. We give
an overview of the new annotation tool developed for this project,
and finally present an analysis of the inter-annotator agreement
between two annotators.
QTLeap WSD/NED Corpora: SemanticAnnotation of Parallel Corpora in Six LanguagesArantxa Otegi, Nora Aranberri, António Branco, Jan Hajic,Martin Popel, Kiril Simov, Eneko Agirre, Petya Osenova,Rita Pereira, João Silva and Steven Neale
This work presents parallel corpora automatically annotated with
several NLP tools, including lemma and part-of-speech tagging,
named-entity recognition and classification, named-entity
disambiguation, word-sense disambiguation, and coreference.
The corpora comprise both the well-known Europarl corpus and a
domain-specific question-answer troubleshooting corpus on the
IT domain. English is common in all parallel corpora, with
translations in five languages, namely, Basque, Bulgarian, Czech,
Portuguese and Spanish. We describe the annotated corpora and
the tools used for annotation, as well as annotation statistics for
each language. These new resources are freely available and will
help research on semantic processing for machine translation and
cross-lingual transfer.
Combining Semantic Annotation of Word Sense& Semantic Roles: A Novel Annotation Schemefor VerbNet Roles on German Language DataÉva Mújdricza-Maydt, Silvana Hartmann, Iryna Gurevychand Anette Frank
We present a VerbNet-based annotation scheme for semantic roles
that we explore in an annotation study on German language
data that combines word sense and semantic role annotation.
We reannotate a substantial portion of the SALSA corpus with
GermaNet senses and a revised scheme of VerbNet roles. We
provide a detailed evaluation of the interaction between sense and
role annotation. The resulting corpus will allow us to compare
VerbNet role annotation for German to FrameNet and PropBank
annotation by mapping to existing role annotations on the SALSA
105
corpus. We publish the annotated corpus and detailed guidelines
for the new role annotation scheme.
Synset Ranking of Hindi WordNet
Sudha Bhingardive, Rajita Shukla, Jaya Saraswati, LaxmiKashyap, Dhirendra Singh and Pushpak Bhattacharya
Word Sense Disambiguation (WSD) is one of the open problems
in the area of natural language processing. Various supervised,
unsupervised and knowledge based approaches have been
proposed for automatically determining the sense of a word in a
particular context. It has been observed that such approaches often
find it difficult to beat the WordNet First Sense (WFS) baseline
which assigns the sense irrespective of context. In this paper, we
present our work on creating the WFS baseline for Hindi language
by manually ranking the synsets of Hindi WordNet. A ranking
tool is developed where human experts can see the frequency of
the word senses in the sense-tagged corpora and have been asked
to rank the senses of a word by using this information and also
his/her intuition. The accuracy of WFS baseline is tested on
several standard datasets. F-score is found to be 60%, 65% and
55% on Health, Tourism and News datasets respectively. The
created rankings can also be used in other NLP applications viz.,
Machine Translation, Information Retrieval, Text Summarization,
etc.
Neural Embedding Language Models in SemanticClustering of Web Search Results
Andrey Kutuzov and Elizaveta Kuzmenko
In this paper, a new approach towards semantic clustering of the
results of ambiguous search queries is presented. We propose
using distributed vector representations of words trained with
the help of prediction-based neural embedding models to detect
senses of search queries and to cluster search engine results page
according to these senses. The words from titles and snippets
together with semantic relationships between them form a graph,
which is further partitioned into components related to different
query senses. This approach to search engine results clustering is
evaluated against a new manually annotated evaluation data set of
Russian search queries. We show that in the task of semantically
clustering search results, prediction-based models slightly but
stably outperform traditional count-based ones, with the same
training corpora.
O29 - Panel on International Initiatives fromPublic AgenciesThursday, May 26, 16:55
Chairperson: Khalid Choukri Oral Session
O30 - Multimodality, Multimedia andEvaluationThursday, May 26, 16:55
Chairperson: Nick Campbell Oral Session
Impact of Automatic Segmentation on the Quality,Productivity and Self-reported Post-editing Effortof Intralingual SubtitlesAitor Alvarez, Marina Balenciaga, Arantza del Pozo,Haritz Arzelus, Anna Matamala and Carlos-D. Martínez-Hinarejos
This paper describes the evaluation methodology followed to
measure the impact of using a machine learning algorithm to
automatically segment intralingual subtitles. The segmentation
quality, productivity and self-reported post-editing effort achieved
with such approach are shown to improve those obtained by the
technique based in counting characters, mainly employed for
automatic subtitle segmentation currently. The corpus used to
train and test the proposed automated segmentation method is also
described and shared with the community, in order to foster further
research in this area.
1 Million Captioned Dutch Newspaper ImagesDesmond Elliott and Martijn Kleppe
Images naturally appear alongside text in a wide variety of media,
such as books, magazines, newspapers, and in online articles. This
type of multi-modal data offers an interesting basis for vision and
language research but most existing datasets use crowdsourced
text, which removes the images from their original context. In this
paper, we introduce the KBK-1M dataset of 1.6 million images
in their original context, with co-occurring texts found in Dutch
newspapers from 1922 - 1994. The images are digitally scanned
photographs, cartoons, sketches, and weather forecasts; the text
is generated from OCR scanned blocks. The dataset is suitable
for experiments in automatic image captioning, image–article
matching, object recognition, and data-to-text generation for
weather forecasting. It can also be used by humanities scholars to
analyse photographic style changes, the representation of people
and societal issues, and new tools for exploring photograph reuse
via image-similarity-based search.
Cross-validating Image Description Datasets andEvaluation MetricsJosiah Wang and Robert Gaizauskas
The task of automatically generating sentential descriptions of
image content has become increasingly popular in recent years,
resulting in the development of large-scale image description
datasets and the proposal of various metrics for evaluating image
description generation systems. However, not much work has
106
been done to analyse and understand both datasets and the
metrics. In this paper, we propose using a leave-one-out cross
validation (LOOCV) process as a means to analyse multiply
annotated, human-authored image description datasets and the
various evaluation metrics, i.e. evaluating one image description
against other human-authored descriptions of the same image.
Such an evaluation process affords various insights into the image
description datasets and evaluation metrics, such as the variations
of image descriptions within and across datasets and also what
the metrics capture. We compute and analyse (i) human upper-
bound performance; (ii) ranked correlation between metric pairs
across datasets; (iii) lower-bound performance by comparing a
set of descriptions describing one image to another sentence not
describing that image. Interesting observations are made about
the evaluation metrics and image description datasets, and we
conclude that such cross-validation methods are extremely useful
for assessing and gaining insights into image description datasets
and evaluation metrics for image descriptions.
Detection of Major ASL Sign Types in ContinuousSigning For ASL Recognition
Polina Yanovich, Carol Neidle and Dimitris Metaxas
In American Sign Language (ASL) as well as other signed
languages, different classes of signs (e.g., lexical signs,
fingerspelled signs, and classifier constructions) have different
internal structural properties. Continuous sign recognition
accuracy can be improved through use of distinct recognition
strategies, as well as different training datasets, for each class
of signs. For these strategies to be applied, continuous
signing video needs to be segmented into parts corresponding to
particular classes of signs. In this paper we present a multiple
instance learning-based segmentation system that accurately
labels 91.27% of the video frames of 500 continuous utterances
(including 7 different subjects) from the publicly accessible
NCSLGR corpus http://secrets.rutgers.edu/dai/
queryPages (Neidle and Vogler, 2012). The system uses novel
feature descriptors derived from both motion and shape statistics
of the regions of high local motion. The system does not require a
hand tracker.
O31 - Summarisation and SimplificationThursday, May 26, 16:55
Chairperson: Udo Kruschwitz Oral Session
Benchmarking Lexical Simplification Systems
Gustavo Paetzold and Lucia Specia
Lexical Simplification is the task of replacing complex words
in a text with simpler alternatives. A variety of strategies
have been devised for this challenge, yet there has been little
effort in comparing their performance. In this contribution,
we present a benchmarking of several Lexical Simplification
systems. By combining resources created in previous work
with automatic spelling and inflection correction techniques,
we introduce BenchLS: a new evaluation dataset for the task.
Using BenchLS, we evaluate the performance of solutions for
various steps in the typical Lexical Simplification pipeline,
both individually and jointly. This is the first time Lexical
Simplification systems are compared in such fashion on the same
data, and the findings introduce many contributions to the field,
revealing several interesting properties of the systems evaluated.
A Multi-Layered Annotated Corpus of ScientificPapers
Beatriz Fisas, Francesco Ronzano and Horacio Saggion
Scientific literature records the research process with a
standardized structure and provides the clues to track the
progress in a scientific field. Understanding its internal structure
and content is of paramount importance for natural language
processing (NLP) technologies. To meet this requirement, we
have developed a multi-layered annotated corpus of scientific
papers in the domain of Computer Graphics. Sentences are
annotated with respect to their role in the argumentative structure
of the discourse. The purpose of each citation is specified.
Special features of the scientific discourse such as advantages and
disadvantages are identified. In addition, a grade is allocated to
each sentence according to its relevance for being included in
a summary.To the best of our knowledge, this complex, multi-
layered collection of annotations and metadata characterizing a
set of research papers had never been grouped together before in
one corpus and therefore constitutes a newer, richer resource with
respect to those currently available in the field.
Extractive Summarization under Strict LengthConstraints
Yashar Mehdad, Amanda Stent, Kapil Thadani, DragomirRadev, Youssef Billawala and Karolina Buchner
In this paper we report a comparison of various techniques
for single-document extractive summarization under strict length
budgets, which is a common commercial use case (e.g.
summarization of news articles by news aggregators). We show
that, evaluated using ROUGE, numerous algorithms from the
literature fail to beat a simple lead-based baseline for this task.
However, a supervised approach with lightweight and efficient
features improves over the lead-based baseline. Additional
human evaluation demonstrates that the supervised approach also
107
performs competitively with a commercial system that uses more
sophisticated features.
What’s the Issue Here?: Task-based Evaluation ofReader Comment Summarization Systems
Emma Barker, Monica Paramita, Adam Funk, EminaKurtic, Ahmet Aker, Jonathan Foster, Mark Hepple andRobert Gaizauskas
Automatic summarization of reader comments in on-line news
is an extremely challenging task and a capability for which
there is a clear need. Work to date has focussed on producing
extractive summaries using well-known techniques imported from
other areas of language processing. But are extractive summaries
of comments what users really want? Do they support users
in performing the sorts of tasks they are likely to want to
perform with reader comments? In this paper we address these
questions by doing three things. First, we offer a specification
of one possible summary type for reader comment, based on an
analysis of reader comment in terms of issues and viewpoints.
Second, we define a task-based evaluation framework for reader
comment summarization that allows summarization systems to
be assessed in terms of how well they support users in a time-
limited task of identifying issues and characterising opinion on
issues in comments. Third, we describe a pilot evaluation in
which we used the task-based evaluation framework to evaluate a
prototype reader comment clustering and summarization system,
demonstrating the viability of the evaluation framework and
illustrating the sorts of insight such an evaluation affords.
O32 - Morphology (2)Thursday, May 26, 16:55
Chairperson: Marko Tadic Oral Session
A Novel Evaluation Method for MorphologicalSegmentation
Javad Nouri and Roman Yangarber
Unsupervised learning of morphological segmentation of words
in a language, based only on a large corpus of words, is a
challenging task. Evaluation of the learned segmentations is
a challenge in itself, due to the inherent ambiguity of the
segmentation task. There is no way to posit unique “correct”
segmentation for a set of data in an objective way. Two models
may arrive at different ways of segmenting the data, which may
nonetheless both be valid. Several evaluation methods have been
proposed to date, but they do not insist on consistency of the
evaluated model. We introduce a new evaluation methodology,
which enforces correctness of segmentation boundaries while also
assuring consistency of segmentation decisions across the corpus.
Bilingual Lexicon Extraction at the MorphemeLevel Using Distributional Analysis
Amir Hazem and Béatrice Daille
Bilingual lexicon extraction from comparable corpora is usually
based on distributional methods when dealing with single word
terms (SWT). These methods often treat SWT as single tokens
without considering their compositional property. However, many
SWT are compositional (composed of roots and affixes) and
this information, if taken into account can be very useful to
match translational pairs, especially for infrequent terms where
distributional methods often fail. For instance, the English
compound xenograft which is composed of the root xeno and
the lexeme graft can be translated into French compositionally
by aligning each of its elements (xeno with xéno and graft with
greffe) resulting in the translation: xénogreffe. In this paper,
we experiment several distributional modellings at the morpheme
level that we apply to perform compositional translation to a
subset of French and English compounds. We show promising
results using distributional analysis at the root and affix levels.
We also show that the adapted approach significantly improve
bilingual lexicon extraction from comparable corpora compared
to the approach at the word level.
Remote Elicitation of Inflectional Paradigms toSeed Morphological Analysis in Low-ResourceLanguages
John Sylak-Glassman, Christo Kirov and David Yarowsky
Structured, complete inflectional paradigm data exists for very few
of the world’s languages, but is crucial to training morphological
analysis tools. We present methods inspired by linguistic
fieldwork for gathering inflectional paradigm data in a machine-
readable, interoperable format from remotely-located speakers of
any language. Informants are tasked with completing language-
specific paradigm elicitation templates. Templates are constructed
by linguists using grammatical reference materials to ensure
completeness. Each cell in a template is associated with
contextual prompts designed to help informants with varying
levels of linguistic expertise (from professional translators to
untrained native speakers) provide the desired inflected form. To
facilitate downstream use in interoperable NLP/HLT applications,
each cell is also associated with a language-independent machine-
readable set of morphological tags from the UniMorph Schema.
This data is useful for seeding morphological analysis and
generation software, particularly when the data is representative
of the range of surface morphological variation in the language.
108
At present, we have obtained 792 lemmas and 25,056 inflected
forms from 15 languages.
Very-large Scale Parsing and Normalization ofWiktionary Morphological Paradigms
Christo Kirov, John Sylak-Glassman, Roger Que and DavidYarowsky
Wiktionary is a large-scale resource for cross-lingual lexical
information with great potential utility for machine translation
(MT) and many other NLP tasks, especially automatic
morphological analysis and generation. However, it is designed
primarily for human viewing rather than machine readability,
and presents numerous challenges for generalized parsing
and extraction due to a lack of standardized formatting and
grammatical descriptor definitions. This paper describes a
large-scale effort to automatically extract and standardize the
data in Wiktionary and make it available for use by the NLP
research community. The methodological innovations include a
multidimensional table parsing algorithm, a cross-lexeme, token-
frequency-based method of separating inflectional form data
from grammatical descriptors, the normalization of grammatical
descriptors to a unified annotation scheme that accounts for cross-
linguistic diversity, and a verification and correction process that
exploits within-language, cross-lexeme table format consistency
to minimize human effort. The effort described here resulted
in the extraction of a uniquely large normalized resource of
nearly 1,000,000 inflectional paradigms across 350 languages.
Evaluation shows that even though the data is extracted using
a language-independent approach, it is comparable in quantity
and quality to data extracted using hand-tuned, language-specific
approaches.
P40 - Dialogue (1)Thursday, May 26, 18:20
Chairperson: Jens Edlund Poster Session
AppDialogue: Multi-App Dialogues for IntelligentAssistants
Ming Sun, Yun-Nung Chen, Zhenhao Hua, Yulian Tamres-Rudnicky, Arnab Dash and Alexander Rudnicky
Users will interact with an individual app on smart devices
(e.g., phone, TV, car) to fulfill a specific goal (e.g. find
a photographer), but users may also pursue more complex
tasks that will span multiple domains and apps (e.g. plan a
wedding ceremony). Planning and executing such multi-app
tasks are typically managed by users, considering the required
global context awareness. To investigate how users arrange
domains/apps to fulfill complex tasks in their daily life, we
conducted a user study on 14 participants to collect such data
from their Android smart phones. This document 1) summarizes
the techniques used in the data collection and 2) provides a brief
statistical description of the data. This data guilds the future
direction for researchers in the fields of conversational agent
and personal assistant, etc. This data is available at http:
//AppDialogue.com.
Modelling Multi-issue Bargaining Dialogues: DataCollection, Annotation Design and Corpus
Volha Petukhova, Christopher Stevens, Harmen de Weerd,Niels Taatgen, Fokie Cnossen and Andrei Malchanau
The paper describes experimental dialogue data collection
activities, as well semantically annotated corpus
creation undertaken within EU-funded METALOGUE
project(www.metalogue.eu). The project aims to develop a
dialogue system with flexible dialogue management to enable
system’s adaptive, reactive, interactive and proactive dialogue
behavior in setting goals, choosing appropriate strategies and
monitoring numerous parallel interpretation and management
processes. To achieve these goals negotiation (or more precisely
multi-issue bargaining) scenario has been considered as the
specific setting and application domain. The dialogue corpus
forms the basis for the design of task and interaction models of
participants negotiation behavior, and subsequently for dialogue
system development which would be capable to replace one of
the negotiators. The METALOGUE corpus will be released to the
community for research purposes.
The Negochat Corpus of Human-agentNegotiation Dialogues
Vasily Konovalov, Ron Artstein, Oren Melamud and IdoDagan
Annotated in-domain corpora are crucial to the successful
development of dialogue systems of automated agents, and in
particular for developing natural language understanding (NLU)
components of such systems. Unfortunately, such important
resources are scarce. In this work, we introduce an annotated
natural language human-agent dialogue corpus in the negotiation
domain. The corpus was collected using Amazon Mechanical
Turk following the ‘Wizard-Of-Oz’ approach, where a ‘wizard’
human translates the participants’ natural language utterances in
real time into a semantic language. Once dialogue collection
was completed, utterances were annotated with intent labels
by two independent annotators, achieving high inter-annotator
agreement. Our initial experiments with an SVM classifier show
that automatically inferring such labels from the utterances is far
109
from trivial. We make our corpus publicly available to serve as an
aid in the development of dialogue systems for negotiation agents,
and suggest that analogous corpora can be created following our
methodology and using our available source code. To the best
of our knowledge this is the first publicly available negotiation
dialogue corpus.
The dialogue breakdown detection challenge:Task description, datasets, and evaluation metrics
Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Kobayashiand Michimasa Inaba
Dialogue breakdown detection is a promising technique in
dialogue systems. To promote the research and development of
such a technique, we organized a dialogue breakdown detection
challenge where the task is to detect a system’s inappropriate
utterances that lead to dialogue breakdowns in chat. This paper
describes the design, datasets, and evaluation metrics for the
challenge as well as the methods and results of the submitted runs
of the participants.
The DialogBank
Harry Bunt, Volha Petukhova, Andrei Malchanau, KarsWijnhoven and Alex Fang
This paper presents the DialogBank, a new language resource
consisting of dialogues with gold standard annotations according
to the ISO 24617-2 standard. Some of these dialogues have
been taken from existing corpora and have been re-annotated
according to the ISO standard; others have been annotated directly
according to the standard. The ISO 24617-2 annotations have
been designed according to the ISO principles for semantic
annotation, as formulated in ISO 24617-6. The DialogBank makes
use of three alternative representation formats, which are shown to
be interoperable.
Coordinating Communication in the Wild: TheArtwalk Dialogue Corpus of PedestrianNavigation and Mobile ReferentialCommunication
Kris Liu, Jean Fox Tree and Marilyn Walker
The Artwalk Corpus is a collection of 48 mobile phone
conversations between 24 pairs of friends and 24 pairs of
strangers performing a novel, naturalistically-situated referential
communication task. This task produced dialogues which, on
average, are just under 40 minutes. The task requires the
identification of public art while walking around and navigating
pedestrian routes in the downtown area of Santa Cruz, California.
The task involves a Director on the UCSC campus with access
to maps providing verbal instructions to a Follower executing
the task. The task provides a setting for real-world situated
dialogic language and is designed to: (1) elicit entrainment
and coordination of referring expressions between the dialogue
participants, (2) examine the effect of friendship on dialogue
strategies, and (3) examine how the need to complete the task
while negotiating myriad, unanticipated events in the real world
– such as avoiding cars and other pedestrians – affects linguistic
coordination and other dialogue behaviors. Previous work
on entrainment and coordinating communication has primarily
focused on similar tasks in laboratory settings where there are no
interruptions and no need to navigate from one point to another
in a complex space. The corpus provides a general resource for
studies on how coordinated task-oriented dialogue changes when
we move outside the laboratory and into the world. It can also
be used for studies of entrainment in dialogue, and the form and
style of pedestrian instruction dialogues, as well as the effect of
friendship on dialogic behaviors.
Managing Linguistic and TerminologicalVariation in a Medical Dialogue SystemLeonardo Campillos Llanos, Dhouha Bouamor, PierreZweigenbaum and Sophie Rosset
We introduce a dialogue task between a virtual patient and a doctor
where the dialogue system, playing the patient part in a simulated
consultation, must reconcile a specialized level, to understand
what the doctor says, and a lay level, to output realistic patient-
language utterances. This increases the challenges in the analysis
and generation phases of the dialogue. This paper proposes
methods to manage linguistic and terminological variation in
that situation and illustrates how they help produce realistic
dialogues. Our system makes use of lexical resources for
processing synonyms, inflectional and derivational variants, or
pronoun/verb agreement. In addition, specialized knowledge is
used for processing medical roots and affixes, ontological relations
and concept mapping, and for generating lay variants of terms
according to the patient’s non-expert discourse. We also report the
results of a first evaluation carried out by 11 users interacting with
the system. We evaluated the non-contextual analysis module,
which supports the Spoken Language Understanding step. The
annotation of task domain entities obtained 91.8% of Precision,
82.5% of Recall, 86.9% of F-measure, 19.0% of Slot Error Rate,
and 32.9% of Sentence Error Rate.
A Corpus of Word-Aligned Asked and AnticipatedQuestions in a Virtual Patient Dialogue SystemAjda Gokcen, Evan Jaffe, Johnsey Erdmann, Michael Whiteand Douglas Danforth
We present a corpus of virtual patient dialogues to which we
have added manually annotated gold standard word alignments.
Since each question asked by a medical student in the dialogues
110
is mapped to a canonical, anticipated version of the question,
the corpus implicitly defines a large set of paraphrase (and non-
paraphrase) pairs. We also present a novel process for selecting
the most useful data to annotate with word alignments and for
ensuring consistent paraphrase status decisions. In support of
this process, we have enhanced the earlier Edinburgh alignment
tool (Cohn et al., 2008) and revised and extended the Edinburgh
guidelines, in particular adding guidance intended to ensure that
the word alignments are consistent with the overall paraphrase
status decision. The finished corpus and the enhanced alignment
tool are made freely available.
A CUP of CoFee: A large Collection of feedbackUtterances Provided with communicative functionannotations
Laurent Prévot, Jan Gorisch and Roxane Bertrand
There have been several attempts to annotate communicative
functions to utterances of verbal feedback in English previously.
Here, we suggest an annotation scheme for verbal and non-verbal
feedback utterances in French including the categories base,
attitude, previous and visual. The data comprises conversations,
maptasks and negotiations from which we extracted ca. 13,000
candidate feedback utterances and gestures. 12 students were
recruited for the annotation campaign of ca. 9,500 instances. Each
instance was annotated by between 2 and 7 raters. The evaluation
of the annotation agreement resulted in an average best-pair kappa
of 0.6. While the base category with the values acknowledgement,
evaluation, answer, elicit achieve good agreement, this is not the
case for the other main categories. The data sets, which also
include automatic extractions of lexical, positional and acoustic
features, are freely available and will further be used for machine
learning classification experiments to analyse the form-function
relationship of feedback.
P41 - Language LearningThursday, May 26, 18:20
Chairperson: Costanza Navarretta Poster Session
Palabras: Crowdsourcing Transcriptions of L2Speech
Eric Sanders, Pepi Burgos, Catia Cucchiarini and Roelandvan Hout
We developed a web application for crowdsourcing transcriptions
of Dutch words spoken by Spanish L2 learners. In this paper we
discuss the design of the application and the influence of metadata
and various forms of feedback. Useful data were obtained from
159 participants, with an average of over 20 transcriptions per
item, which seems a satisfactory result for this type of research.
Informing participants about how many items they still had to
complete, and not how many they had already completed, turned
to be an incentive to do more items. Assigning participants a
score for their performance made it more attractive for them to
carry out the transcription task, but this seemed to influence their
performance. We discuss possible advantages and disadvantages
in connection with the aim of the research and consider possible
lessons for designing future experiments.
The Uppsala Corpus of Student Writings: CorpusCreation, Annotation, and Analysis
Beata Megyesi, Jesper Näsman and Anne Palmér
The Uppsala Corpus of Student Writings consists of Swedish
texts produced as part of a national test of students ranging in
age from nine (in year three of primary school) to nineteen (the
last year of upper secondary school) who are studying either
Swedish or Swedish as a second language. National tests have
been collected since 1996. The corpus currently consists of 2,500
texts containing over 1.5 million tokens. Parts of the texts have
been annotated on several linguistic levels using existing state-of-
the-art natural language processing tools. In order to make the
corpus easy to interpret for scholars in the humanities, we chose
the CoNLL format instead of an XML-based representation. Since
spelling and grammatical errors are common in student writings,
the texts are automatically corrected while keeping the original
tokens in the corpus. Each token is annotated with part-of-speech
and morphological features as well as syntactic structure. The
main purpose of the corpus is to facilitate the systematic and
quantitative empirical study of the writings of various student
groups based on gender, geographic area, age, grade awarded or
a combination of these, synchronically or diachronically. The
intention is for this to be a monitor corpus, currently under
development.
Corpus for Children’s Writing with EnhancedOutput for Specific Spelling Patterns (2nd and 3rdGrade)
Kay Berkling
This paper describes the collection of the H1 Corpus of children’s
weekly writing over the course of 3 months in 2nd and 3rd grades,
aged 7-11. The texts were collected within the normal classroom
setting by the teacher. Texts of children whose parents signed
the permission to donate the texts to science were collected and
transcribed. The corpus consists of the elicitation techniques, an
overview of the data collected and the transcriptions of the texts
both with and without spelling errors, aligned on a word by word
basis, as well as the scanned in texts. The corpus is available
for research via Linguistic Data Consortium (LDC). Researchers
111
are strongly encouraged to make additional annotations and
improvements and return it to the public domain via LDC.
The COPLE2 corpus: a learner corpus forPortuguese
Amália Mendes, Sandra Antunes, Maarten Janssen andAnabela Gonçalves
We present the COPLE2 corpus, a learner corpus of Portuguese
that includes written and spoken texts produced by learners
of Portuguese as a second or foreign language. The corpus
includes at the moment a total of 182,474 tokens and 978 texts,
classified according to the CEFR scales. The original handwritten
productions are transcribed in TEI compliant XML format and
keep record of all the original information, such as reformulations,
insertions and corrections made by the teacher, while the
recordings are transcribed and aligned with EXMARaLDA.
The TEITOK environment enables different views of the same
document (XML, student version, corrected version), a CQP-
based search interface, the POS, lemmatization and normalization
of the tokens, and will soon be used for error annotation in stand-
off format. The corpus has already been a source of data for
phonological, lexical and syntactic interlanguage studies and will
be used for a data-informed selection of language features for each
proficiency level.
French Learners Audio Corpus of German Speech(FLACGS)
Jane Wottawa and Martine Adda-Decker
The French Learners Audio Corpus of German Speech (FLACGS)
was created to compare German speech production of German
native speakers (GG) and French learners of German (FG)
across three speech production tasks of increasing production
complexity: repetition, reading and picture description. 40
speakers, 20 GG and 20 FG performed each of the three tasks,
which in total leads to approximately 7h of speech. The corpus
was manually transcribed and automatically aligned. Analysis
that can be performed on this type of corpus are for instance
segmental differences in the speech production of L2 learners
compared to native speakers. We chose the realization of the
velar nasal consonant engma. In spoken French, engma does not
appear in a VCV context which leads to production difficulties in
FG. With increasing speech production complexity (reading and
picture description), engma is realized as engma + plosive by FG
in over 50% of the cases. The results of a two way ANOVA with
unequal sample sizes on the durations of the different realizations
of engma indicate that duration is a reliable factor to distinguish
between engma and engma + plosive in FG productions compared
to the engma productions in GG in a VCV context. The FLACGS
corpus allows to study L2 production and perception.
Croatian Error-Annotated Corpus ofNon-Professional Written Language
Vanja Štefanec, Nikola Ljubešic and Jelena Kuvac Kraljevic
In the paper authors present the Croatian corpus of non-
professional written language. Consisting of two subcorpora,
i.e. the clinical subcorpus, consisting of written texts produced
by speakers with various types of language disorders, and the
healthy speakers subcorpus, as well as by the levels of its
annotation, it offers an opportunity for different lines of research.
The authors present the corpus structure, describe the sampling
methodology, explain the levels of annotation, and give some very
basic statistics. On the basis of data from the corpus, existing
language technologies for Croatian are adapted in order to be
implemented in a platform facilitating text production to speakers
with language disorders. In this respect, several analyses of the
corpus data and a basic evaluation of the developed technologies
are presented.
P42 - Less-Resourced LanguagesThursday, May 26, 18:20
Chairperson: Laurette Pretorius Poster Session
Training & Quality Assessment of an OpticalCharacter Recognition Model for Northern Haida
Isabell Hubert, Antti Arppe, Jordan Lachler and EddieAntonio Santos
We are presenting our work on the creation of the first optical
character recognition (OCR) model for Northern Haida, also
known as Masset or Xaad Kil, a nearly extinct First Nations
language spoken in the Haida Gwaii archipelago in British
Columbia, Canada. We are addressing the challenges of training
an OCR model for a language with an extensive, non-standard
Latin character set as follows: (1) We have compared various
training approaches and present the results of practical analyses
to maximize recognition accuracy and minimize manual labor.
An approach using just one or two pages of Source Images
directly performed better than the Image Generation approach,
and better than models based on three or more pages. Analyses
also suggest that a character’s frequency is directly correlated
with its recognition accuracy. (2) We present an overview of
current OCR accuracy analysis tools available. (3) We have ported
the once de-facto standardized OCR accuracy tools to be able
to cope with Unicode input. Our work adds to a growing body
of research on OCR for particularly challenging character sets,
112
and contributes to creating the largest electronic corpus for this
severely endangered language.
Legacy language atlas data mining: mapping Krulanguages
Dafydd Gibbon
An online tool based on dialectometric methods, DistGraph, is
applied to a group of Kru languages of Côte d’Ivoire, Liberia
and Burkina Faso. The inputs to this resource consist of tables
of languages x linguistic features (e.g. phonological, lexical or
grammatical), and statistical and graphical outputs are generated
which show similarities and differences between the languages
in terms of the features as virtual distances. In the present
contribution, attention is focussed on the consonant systems of the
languages, a traditional starting point for language comparison.
The data are harvested from a legacy language data resource based
on fieldwork in the 1970s and 1980s, a language atlas of the
Kru languages. The method on which the online tool is based
extends beyond documentation of individual languages to the
documentation of language groups, and supports difference-based
prioritisation in education programmes, decisions on language
policy and documentation and conservation funding, as well as
research on language typology and heritage documentation of
history and migration.
Data Formats and Management Strategies fromthe Perspective of Language Resource Producers– Personal Diachronic and Social Synchronic DataSharing –
Kazushi Ohya
This is a report of findings from on-going language documentation
research based on three consecutive projects from 2008 to 2016.
In the light of this research, we propose that (1) we should stand
on the side of language resource producers to enhance the research
of language processing. We support personal data management
in addition to social data sharing. (2) This support leads to
adopting simple data formats instead of the multi-link-path data
models proposed as international standards up to the present.
(3) We should set up a framework for total language resource
study that includes not only pivotal data formats such as standard
formats, but also the surroundings of data formation to capture
a wider range of language activities, e.g. annotation, hesitant
language formation, and reference-referent relations. A study of
this framework is expected to be a foundation of rebuilding man-
machine interface studies in which we seek to observe generative
processes of informational symbols in order to establish a high
affinity interface in regard to documentation.
Curation of Dutch Regional DictionariesHenk van den Heuvel, Eric Sanders and Nicoline van derSijs
This paper describes the process of semi-automatically converting
dictionaries from paper to structured text (database) and the
integration of these into the CLARIN infrastructure in order
to make the dictionaries accessible and retrievable for the
research community. The case study at hand is that of the
curation of 42 fascicles of the Dictionaries of the Brabantic and
Limburgian dialects, and 6 fascicles of the Dictionary of dialects
in Gelderland.
Fostering digital representation of EU regionaland minority languages: the Digital LanguageDiversity ProjectClaudia Soria, Irene Russo, Valeria Quochi, Davyth Hicks,Antton Gurrutxaga, Anneli Sarhimaa and Matti Tuomisto
Poor digital representation of minority languages further prevents
their usability on digital media and devices. The Digital Language
Diversity Project, a three-year project funded under the Erasmus+
programme, aims at addressing the problem of low digital
representation of EU regional and minority languages by giving
their speakers the intellectual an practical skills to create, share,
and reuse online digital content. Availability of digital content
and technical support to use it are essential prerequisites for the
development of language-based digital applications, which in turn
can boost digital usage of these languages. In this paper we
introduce the project, its aims, objectives and current activities for
sustaining digital usability of minority languages through adult
education.
Cysill Ar-lein: A Corpus of WrittenContemporary Welsh Compiled from an On-lineSpelling and Grammar CheckerDelyth Prys, Gruffudd Prys and Dewi Bryn Jones
This paper describes the use of a free, on-line language spelling
and grammar checking aid as a vehicle for the collection of
a significant (31 million words and rising) corpus of text for
academic research in the context of less resourced languages
where such data in sufficient quantities are often unavailable. It
describes two versions of the corpus: the texts as submitted,
prior to the correction process, and the texts following the user’s
incorporation of any suggested changes. An overview of the
corpus’ contents is given and an analysis of use including usage
statistics is also provided. Issues surrounding privacy and the
anonymization of data are explored as is the data’s potential use
for linguistic analysis, lexical research and language modelling.
The method used for gathering this corpus is believed to be
113
unique, and is a valuable addition to corpus studies in a minority
language.
ALT Explored: Integrating an OnlineDialectometric Tool and an Online Dialect AtlasMartijn Wieling, Eva Sassolini, Sebastiana Cucurullo andSimonetta Montemagni
In this paper, we illustrate the integration of an online
dialectometric tool, Gabmap, together with an online dialect
atlas, the Atlante Lessicale Toscano (ALT-Web). By using
a newly created url-based interface to Gabmap, ALT-Web is
able to take advantage of the sophisticated dialect visualization
and exploration options incorporated in Gabmap. For example,
distribution maps showing the distribution in the Tuscan dialect
area of a specific dialectal form (selected via the ALT-Web
website) are easily obtainable. Furthermore, the complete ALT-
Web dataset as well as subsets of the data (selected via the ALT-
Web website) can be automatically uploaded and explored in
Gabmap. By combining these two online applications, macro- and
micro-analyses of dialectal data (respectively offered by Gabmap
and ALT-Web) are effectively and dynamically combined.
LORELEI Language Packs: Data, Tools, andResources for Technology Development in LowResource LanguagesStephanie Strassel and Jennifer Tracey
In this paper, we describe the textual linguistic resources in nearly
3 dozen languages being produced by Linguistic Data Consortium
for DARPA’s LORELEI (Low Resource Languages for Emergent
Incidents) Program. The goal of LORELEI is to improve the
performance of human language technologies for low-resource
languages and enable rapid re-training of such technologies for
new languages, with a focus on the use case of deployment
of resources in sudden emergencies such as natural disasters.
Representative languages have been selected to provide broad
typological coverage for training, and surprise incident languages
for testing will be selected over the course of the program. Our
approach treats the full set of language packs as a coherent
whole, maintaining LORELEI-wide specifications, tagsets, and
guidelines, while allowing for adaptation to the specific needs
created by each language. Each representative language corpus,
therefore, both stands on its own as a resource for the specific
language and forms part of a large multilingual resource for
broader cross-language technology development.
A Computational Perspective on the RomanianDialectsAlina Maria Ciobanu and Liviu P. Dinu
In this paper we conduct an initial study on the dialects of
Romanian. We analyze the differences between Romanian and its
dialects using the Swadesh list. We analyze the predictive power
of the orthographic and phonetic features of the words, building a
classification problem for dialect identification.
The Alaskan Athabascan Grammar DatabaseSebastian Nordhoff, Siri Tuttle and Olga Lovick
This paper describes a repository of example sentences in three
endangered Athabascan languages: Koyukon, Upper Tanana,
Lower Tanana. The repository allows researchers or language
teachers to browse the example sentence corpus to either
investigate the languages or to prepare teaching materials. The
originally heterogeneous text collection was imported into a
SOLR store via the POIO bridge. This paper describes the
requirements, implementation, advantages and drawbacks of this
approach and discusses the potential to apply it for other languages
of the Athabascan family or beyond.
Constraint-Based Bilingual Lexicon Induction forClosely Related LanguagesArbi Haza Nasution, Yohei Murakami and Toru Ishida
The lack or absence of parallel and comparable corpora makes
bilingual lexicon extraction becomes a difficult task for low-
resource languages. Pivot language and cognate recognition
approach have been proven useful to induce bilingual lexicons
for such languages. We analyze the features of closely related
languages and define a semantic constraint assumption. Based
on the assumption, we propose a constraint-based bilingual
lexicon induction for closely related languages by extending
constraints and translation pair candidates from recent pivot
language approach. We further define three constraint sets
based on language characteristics. In this paper, two controlled
experiments are conducted. The former involves four closely
related language pairs with different language pair similarities,
and the latter focuses on sense connectivity between non-pivot
words and pivot words. We evaluate our result with F-measure.
The result indicates that our method works better on voluminous
input dictionaries and high similarity languages. Finally, we
introduce a strategy to use proper constraint sets for different goals
and language characteristics.
P43 - Named Entity RecognitionThursday, May 26, 18:20
Chairperson: Sara Tonelli Poster Session
WTF-LOD - A New Resource for Large-ScaleNER EvaluationLubomir Otrusina and Pavel Smrz
This paper introduces the Web TextFull linkage to Linked Open
Data (WTF-LOD) dataset intended for large-scale evaluation of
114
named entity recognition (NER) systems. First, we present the
process of collecting data from the largest publically-available
textual corpora, including Wikipedia dumps, monthly runs of
the CommonCrawl, and ClueWeb09/12. We discuss similarities
and differences of related initiatives such as WikiLinks and
WikiReverse. Our work primarily focuses on links from “textfull”
documents (links surrounded by a text that provides a useful
context for entity linking), de-duplication of the data and advanced
cleaning procedures. Presented statistics demonstrate that the
collected data forms one of the largest available resource of
its kind. They also prove suitability of the result for complex
NER evaluation campaigns, including an analysis of the most
ambiguous name mentions appearing in the data.
Using a Language Technology Infrastructure forGerman in order to Anonymize German SignLanguage Corpus Data
Julian Bleicken, Thomas Hanke, Uta Salden and SvenWagner
For publishing sign language corpus data on the web,
anonymization is crucial even if it is impossible to hide the
visual appearance of the signers: In a small community, even
vague references to third persons may be enough to identify those
persons. In the case of the DGS Korpus (German Sign Language
corpus) project, we want to publish data as a contribution to the
cultural heritage of the sign language community while annotation
of the data is still ongoing. This poses the question how
well anonymization can be achieved given that no full linguistic
analysis of the data is available. Basically, we combine analysis
of all data that we have, including named entity recognition
on translations into German. For this, we use the WebLicht
language technology infrastructure. We report on the reliability
of these methods in this special context and also illustrate how the
anonymization of the video data is technically achieved in order
to minimally disturb the viewer.
Crowdsourced Corpus with Entity SalienceAnnotations
Milan Dojchinovski, Dinesh Reddy, Tomas Kliegr, TomasVitvar and Harald Sack
In this paper, we present a crowdsourced dataset which adds entity
salience (importance) annotations to the Reuters-128 dataset,
which is subset of Reuters-21578. The dataset is distributed under
a free license and publish in the NLP Interchange Format, which
fosters interoperability and re-use. We show the potential of the
dataset on the task of learning an entity salience classifier and
report on the results from several experiments.
ELMD: An Automatically Generated EntityLinking Gold Standard Dataset in the MusicDomain
Sergio Oramas, Luis Espinosa Anke, Mohamed Sordo,Horacio Saggion and Xavier Serra
In this paper we present a gold standard dataset for Entity Linking
(EL) in the Music Domain. It contains thousands of musical
named entities such as Artist, Song or Record Label, which
have been automatically annotated on a set of artist biographies
coming from the Music website and social network Last.fm. The
annotation process relies on the analysis of the hyperlinks present
in the source texts and in a voting-based algorithm for EL, which
considers, for each entity mention in text, the degree of agreement
across three state-of-the-art EL systems. Manual evaluation shows
that EL Precision is at least 94%, and due to its tunable nature,
it is possible to derive annotations favouring higher Precision or
Recall, at will. We make available the annotated dataset along
with evaluation data and the code.
Bridge-Language Capitalization Inference inWestern Iranian: Sorani, Kurmanji, Zazaki, andTajik
Patrick Littell, David R. Mortensen, Kartik Goyal, ChrisDyer and Lori Levin
In Sorani Kurdish, one of the most useful orthographic features
in named-entity recognition – capitalization – is absent, as
the language’s Perso-Arabic script does not make a distinction
between uppercase and lowercase letters. We describe a system
for deriving an inferred capitalization value from closely related
languages by phonological similarity, and illustrate the system
using several related Western Iranian languages.
Annotating Named Entities in Consumer HealthQuestions
Halil Kilicoglu, Asma Ben Abacha, Yassine Mrabet, KirkRoberts, Laritza Rodriguez, Sonya Shooshan and DinaDemner-Fushman
We describe a corpus of consumer health questions annotated
with named entities. The corpus consists of 1548 de-identified
questions about diseases and drugs, written in English. We
defined 15 broad categories of biomedical named entities for
annotation. A pilot annotation phase in which a small portion of
the corpus was double-annotated by four annotators was followed
by a main phase in which double annotation was carried out by
six annotators, and a reconciliation phase in which all annotations
115
were reconciled by an expert. We conducted the annotation in
two modes, manual and assisted, to assess the effect of automatic
pre-annotation and calculated inter-annotator agreement. We
obtained moderate inter-annotator agreement; assisted annotation
yielded slightly better agreement and fewer missed annotations
than manual annotation. Due to complex nature of biomedical
entities, we paid particular attention to nested entities for which
we obtained slightly lower inter-annotator agreement, confirming
that annotating nested entities is somewhat more challenging. To
our knowledge, the corpus is the first of its kind for consumer
health text and is publicly available.
A Regional News Corpora for ContextualizedEntity Discovery and Linking
Adrian Brasoveanu, Lyndon J.B. Nixon, AlbertWeichselbraun and Arno Scharl
This paper presents a German corpus for Named Entity Linking
(NEL) and Knowledge Base Population (KBP) tasks. We describe
the annotation guideline, the annotation process, NIL clustering
techniques and conversion to popular NEL formats such as NIF
and TAC that have been used to construct this corpus based
on news transcripts from the German regional broadcaster RBB
(Rundfunk Berlin Brandenburg). Since creating such language
resources requires significant effort, the paper also discusses how
to derive additional evaluation resources for tasks like named
entity contextualization or ontology enrichment by exploiting the
links between named entities from the annotated corpus. The
paper concludes with an evaluation that shows how several well-
known NEL tools perform on the corpus, a discussion of the
evaluation results, and with suggestions on how to keep evaluation
corpora and datasets up to date.
DBpedia Abstracts: A Large-Scale, Open,Multilingual NLP Training Corpus
Martin Brümmer, Milan Dojchinovski and SebastianHellmann
The ever increasing importance of machine learning in Natural
Language Processing is accompanied by an equally increasing
need in large-scale training and evaluation corpora. Due to its
size, its openness and relative quality, the Wikipedia has already
been a source of such data, but on a limited scale. This paper
introduces the DBpedia Abstract Corpus, a large-scale, open
corpus of annotated Wikipedia texts in six languages, featuring
over 11 million texts and over 97 million entity links. The
properties of the Wikipedia texts are being described, as well as
the corpus creation process, its format and interesting use-cases,
like Named Entity Linking training and evaluation.
Government Domain Named Entity Recognitionfor South African Languages
Roald Eiselen
This paper describes the named entity language resources
developed as part of a development project for the South African
languages. The development efforts focused on creating protocols
and annotated data sets with at least 15,000 annotated named
entity tokens for ten of the official South African languages. The
description of the protocols and annotated data sets provide an
overview of the problems encountered during the annotation of
the data sets. Based on these annotated data sets, CRF named
entity recognition systems are developed that leverage existing
linguistic resources. The newly created named entity recognisers
are evaluated, with F-scores of between 0.64 and 0.77, and error
analysis is performed to identify possible avenues for improving
the quality of the systems.
Named Entity Resources - Overview and Outlook
Maud Ehrmann, Damien Nouvel and Sophie Rosset
Recognition of real-world entities is crucial for most NLP
applications. Since its introduction some twenty years ago, named
entity processing has undergone a significant evolution with,
among others, the definition of new tasks (e.g. entity linking) and
the emergence of new types of data (e.g. speech transcriptions,
micro-blogging). These pose certainly new challenges which
affect not only methods and algorithms but especially linguistic
resources. Where do we stand with respect to named entity
resources? This paper aims at providing a systematic overview
of named entity resources, accounting for qualities such as
multilingualism, dynamicity and interoperability, and to identify
shortfalls in order to guide future developments.
Incorporating Lexico-semantic Heuristics intoCoreference Resolution Sieves for Named EntityRecognition at Document-level
Marcos Garcia
This paper explores the incorporation of lexico-semantic
heuristics into a deterministic Coreference Resolution (CR)
system for classifying named entities at document-level. The
highest precise sieves of a CR tool are enriched with both a set
of heuristics for merging named entities labeled with different
classes and also with some constraints that avoid the incorrect
merging of similar mentions. Several tests show that this strategy
improves both NER labeling and CR. The CR tool can be applied
in combination with any system for named entity recognition
using the CoNLL format, and brings benefits to text analytics tasks
116
such as Information Extraction. Experiments were carried out in
Spanish, using three different NER tools.
Using Word Embeddings to Translate NamedEntities
Octavia-Maria Sulea, Sergiu Nisioi and Liviu P. Dinu
In this paper we investigate the usefulness of neural word
embeddings in the process of translating Named Entities (NEs)
from a resource-rich language to a language low on resources
relevant to the task at hand, introducing a novel, yet simple way
of obtaining bilingual word vectors. Inspired by observations
in (Mikolov et al., 2013b), which show that training their word
vector model on comparable corpora yields comparable vector
space representations of those corpora, reducing the problem
of translating words to finding a rotation matrix, and results in
(Zou et al., 2013), which showed that bilingual word embeddings
can improve Chinese Named Entity Recognition (NER) and
English to Chinese phrase translation, we use the sentence-aligned
English-French EuroParl corpora and show that word embeddings
extracted from a merged corpus (corpus resulted from the merger
of the two aligned corpora) can be used to NE translation. We
extrapolate that word embeddings trained on merged parallel
corpora are useful in Named Entity Recognition and Translation
tasks for resource-poor languages.
O33 - Textual EntailmentThursday, May 26, 18:20
Chairperson: Lucia Specia Oral Session
TEG-REP: A corpus of Textual EntailmentGraphs based on Relation Extraction Patterns
Kathrin Eichler, Feiyu Xu, Hans Uszkoreit, LeonhardHennig and Sebastian Krause
The task of relation extraction is to recognize and extract
relations between entities or concepts in texts. Dependency parse
trees have become a popular source for discovering extraction
patterns, which encode the grammatical relations among the
phrases that jointly express relation instances. State-of-the-art
weakly supervised approaches to relation extraction typically
extract thousands of unique patterns only potentially expressing
the target relation. Among these patterns, some are semantically
equivalent, but differ in their morphological, lexical-semantic or
syntactic form. Some express a relation that entails the target
relation. We propose a new approach to structuring extraction
patterns by utilizing entailment graphs, hierarchical structures
representing entailment relations, and present a novel resource
of gold-standard entailment graphs based on a set of patterns
automatically acquired using distant supervision. We describe the
methodology used for creating the dataset and present statistics of
the resource as well as an analysis of inference types underlying
the entailment decisions.
Passing a USA National Bar Exam: a FirstCorpus for ExperimentationBiralatei Fawei, Adam Wyner and Jeff Pan
Bar exams provide a key watershed by which legal professionals
demonstrate their knowledge of the law and its application.
Passing the bar entitles one to practice the law in a given
jurisdiction. The bar provides an excellent benchmark for the
performance of legal information systems since passing the bar
would arguably signal that the system has acquired key aspects
of legal reason on a par with a human lawyer. The paper
provides a corpus and experimental results with material derived
from a real bar exam, treating the problem as a form of textual
entailment from the question to an answer. The providers of
the bar exam material set the Gold Standard, which is the
answer key. The experiments carried out using the ‘out of the
box’ the Excitement Open Platform for textual entailment. The
results and evaluation show that the tool can identify wrong
answers (non-entailment) with a high F1 score, but it performs
poorly in identifying the correct answer (entailment). The results
provide a baseline performance measure against which to evaluate
future improvements. The reasons for the poor performance
are examined, and proposals are made to augment the tool
in the future. The corpus facilitates experimentation by other
researchers.
Corpora for Learning the Mutual Relationshipbetween Semantic Relatedness and TextualEntailmentNgoc Phuoc An Vo and Octavian Popescu
In this paper we present the creation of a corpora annotated with
both semantic relatedness (SR) scores and textual entailment (TE)
judgments. In building this corpus we aimed at discovering, if any,
the relationship between these two tasks for the mutual benefit
of resolving one of them by relying on the insights gained from
the other. We considered a corpora already annotated with TE
judgments and we proceed to the manual annotation with SR
scores. The RTE 1-4 corpora used in the PASCAL competition
fit our need. The annotators worked independently of one each
other and they did not have access to the TE judgment during
annotation. The intuition that the two annotations are correlated
received major support from this experiment and this finding led
to a system that uses this information to revise the initial estimates
of SR scores. As semantic relatedness is one of the most general
and difficult task in natural language processing we expect that
future systems will combine different sources of information in
117
order to solve it. Our work suggests that textual entailment plays
a quantifiable role in addressing it.
O34 - Document Classification, Textcategorisation and Topic DetectionThursday, May 26, 18:20
Chairperson: Iryna Gurevych Oral Session
Can Topic Modelling benefit from Word SenseInformation?
Adriana Ferrugento, Hugo Gonçalo Oliveira, Ana Alvesand Filipe Rodrigues
This paper proposes a new topic model that exploits word
sense information in order to discover less redundant and more
informative topics. Word sense information is obtained from
WordNet and the discovered topics are groups of synsets, instead
of mere surface words. A key feature is that all the known senses
of a word are considered, with their probabilities. Alternative
configurations of the model are described and compared to each
other and to LDA, the most popular topic model. However, the
obtained results suggest that there are no benefits of enriching
LDA with word sense information.
Age and Gender Prediction on Health Forum Data
Prasha Shrestha, Nicolas Rey-Villamizar, Farig Sadeque,Ted Pedersen, Steven Bethard and Thamar Solorio
Health support forums have become a rich source of data that can
be used to improve health care outcomes. A user profile, including
information such as age and gender, can support targeted analysis
of forum data. But users might not always disclose their age and
gender. It is desirable then to be able to automatically extract
this information from users’ content. However, to the best of our
knowledge there is no such resource for author profiling of health
forum data. Here we present a large corpus, with close to 85,000
users, for profiling and also outline our approach and benchmark
results to automatically detect a user’s age and gender from their
forum posts. We use a mix of features from a user’s text as well as
forum specific features to obtain accuracy well above the baseline,
thus showing that both our dataset and our method are useful and
valid.
Comparing Speech and Text Classification onICNALE
Sergiu Nisioi
In this paper we explore and compare a speech and text
classification approach on a corpus of native and non-native
English speakers. We experiment on a subset of the International
Corpus Network of Asian Learners of English containing the
recorded speeches and the equivalent text transcriptions. Our
results suggest a high correlation between the spoken and
written classification results, showing that native accent is highly
correlated with grammatical structures found in text.
O35 - Detecting Information in MedicalDomainThursday, May 26, 18:20
Chairperson: Dimitrios Kokkinakis Oral Session
Monitoring Disease Outbreak Events on the WebUsing Text-mining Approach and Domain ExpertKnowledge
Elena Arsevska, Mathieu Roche, Sylvain Falala, RenaudLancelot, David Chavernac, Pascal Hendrikx and BarbaraDufour
Timeliness and precision for detection of infectious animal disease
outbreaks from the information published on the web is crucial
for prevention against their spread. We propose a generic method
to enrich and extend the use of different expressions as queries
in order to improve the acquisition of relevant disease related
pages on the web. Our method combines a text mining approach
to extract terms from corpora of relevant disease outbreak
documents, and domain expert elicitation (Delphi method) to
propose expressions and to select relevant combinations between
terms obtained with text mining. In this paper we evaluated the
performance as queries of a number of expressions obtained with
text mining and validated by a domain expert and expressions
proposed by a panel of 21 domain experts. We used African swine
fever as an infectious animal disease model. The expressions
obtained with text mining outperformed as queries the expressions
proposed by domain experts. However, domain experts proposed
expressions not extracted automatically. Our method is simple
to conduct and flexible to adapt to any other animal infectious
disease and even in the public health domain.
On Developing Resources for Patient-levelInformation Retrieval
Stephen Wu, Tamara Timmons, Amy Yates, Meikun Wang,Steven Bedrick, William Hersh and Hongfang Liu
Privacy concerns have often served as an insurmountable
barrier for the production of research and resources in clinical
information retrieval (IR). We believe that both clinical IR
research innovation and legitimate privacy concerns can be served
by the creation of intra-institutional, fully protected resources.
In this paper, we provide some principles and tools for IR
118
resource-building in the unique problem setting of patient-level
IR, following the tradition of the Cranfield paradigm.
Annotating and Detecting Medical Events inClinical Notes
Prescott Klassen, Fei Xia and Meliha Yetisgen
Early detection and treatment of diseases that onset after a patient
is admitted to a hospital, such as pneumonia, is critical to
improving and reducing costs in healthcare. Previous studies
(Tepper et al., 2013) showed that change-of-state events in
clinical notes could be important cues for phenotype detection.
In this paper, we extend the annotation schema proposed in
(Klassen et al., 2014) to mark change-of-state events, diagnosis
events, coordination, and negation. After we have completed
the annotation, we build NLP systems to automatically identify
named entities and medical events, which yield an f-score of
94.7% and 91.8%, respectively.
O36 - Speech SynthesisThursday, May 26, 18:20
Chairperson: Diana Santos Oral Session
Speech Synthesis of Code-Mixed Text
Sunayana Sitaram and Alan W Black
Most Text to Speech (TTS) systems today assume that the input
text is in a single language and is written in the same language
that the text needs to be synthesized in. However, in bilingual and
multilingual communities, code mixing or code switching occurs
in speech, in which speakers switch between languages in the
same utterance. Due to the popularity of social media, we now
see code-mixing even in text in these multilingual communities.
TTS systems capable of synthesizing such text need to be able
to handle text that is written in multiple languages and scripts.
Code-mixed text poses many challenges to TTS systems, such as
language identification, spelling normalization and pronunciation
modeling. In this work, we describe a preliminary framework
for synthesizing code-mixed text. We carry out experiments on
synthesizing code-mixed Hindi and English text. We find that
there is a significant user preference for TTS systems that can
correctly identify and pronounce words in different languages.
Chatbot Technology with Synthetic Voices in theAcquisition of an Endangered Language:Motivation, Development and Evaluation of aPlatform for Irish
Neasa Ní Chiaráin and Ailbhe Ní Chasaide
This paper describes the development and evaluation of a chatbot
platform designed for the teaching/learning of Irish. The chatbot
uses synthetic voices developed for the dialects of Irish. Speech-
enabled chatbot technology offers a potentially powerful tool for
dealing with the challenges of teaching/learning an endangered
language where learners have limited access to native speaker
models of the language and limited exposure to the language in
a truly communicative setting. The sociolinguistic context that
motivates the present development is explained. The evaluation
of the chatbot was carried out in 13 schools by 228 pupils
and consisted of two parts. Firstly, learners’ opinions of the
overall chatbot platform as a learning environment were elicited.
Secondly, learners evaluated the intelligibility, quality, and
attractiveness of the synthetic voices used in this platform. Results
were overwhelmingly positive to both the learning platform and
the synthetic voices and indicate that the time may now be ripe for
language learning applications which exploit speech and language
technologies. It is further argued that these technologies have a
particularly vital role to play in the maintenance of the endangered
language.
CHATR the Corpus; a 20-year-old archive ofConcatenative Speech Synthesis
Nick Campbell
This paper reports the preservation of an old speech synthesis
website as a corpus. CHATR was a revolutionary technique
developed in the mid nineties for concatenative speech synthesis.
The method has since become the standard for high quality speech
output by computer although much of the current research is
devoted to parametric or hybrid methods that employ smaller
amounts of data and can be more easily tunable to individual
voices. The system was first reported in 1994 and the website
was functional in 1996. The ATR labs where this system was
invented no longer exist, but the website has been preserved as
a corpus containing 1537 samples of synthesised speech from
that period (118 MB in aiff format) in 211 pages under various
finely interrelated themes The corpus can be accessed from
www.speech-data.jp as well as www.tcd-fastnet.com, where the
original code and samples are now being maintained.
119
O37 - Robots and Conversational AgentsInteractionFriday, May 27, 9:45
Chairperson: Claude Barras Oral Session
How to Address Smart Homes with a SocialRobot? A Multi-modal Corpus of UserInteractions with an Intelligent Environment
Patrick Holthaus, Christian Leichsenring, Jasmin Bernotat,Viktor Richter, Marian Pohling, Birte Carlmeyer, NormanKöster, Sebastian Meyer zu Borgsen, René Zorn, BirteSchiffhauer, Kai Frederic Engelmann, Florian Lier, SimonSchulz, Philipp Cimiano, Friederike Eyssel, ThomasHermann, Franz Kummert, David Schlangen, SvenWachsmuth, Petra Wagner, Britta Wrede and SebastianWrede
In order to explore intuitive verbal and non-verbal interfaces
in smart environments we recorded user interactions with an
intelligent apartment. Besides offering various interactive
capabilities itself, the apartment is also inhabited by a social robot
that is available as a humanoid interface. This paper presents a
multi-modal corpus that contains goal-directed actions of naive
users in attempts to solve a number of predefined tasks. Alongside
audio and video recordings, our data-set consists of large amount
of temporally aligned sensory data and system behavior provided
by the environment and its interactive components. Non-verbal
system responses such as changes in light or display contents, as
well as robot and apartment utterances and gestures serve as a
rich basis for later in-depth analysis. Manual annotations provide
further information about meta data like the current course of
study and user behavior including the incorporated modality, all
literal utterances, language features, emotional expressions, foci
of attention, and addressees.
A Corpus of Gesture-Annotated Dialogues forMonologue-to-Dialogue Generation from PersonalNarratives
Zhichao Hu, Michelle Dick, Chung-Ning Chang, KevinBowden, Michael Neff, Jean Fox Tree and Marilyn Walker
Story-telling is a fundamental and prevalent aspect of human
social behavior. In the wild, stories are told conversationally
in social settings, often as a dialogue and with accompanying
gestures and other nonverbal behavior. This paper presents
a new corpus, the Story Dialogue with Gestures (SDG)
corpus, consisting of 50 personal narratives regenerated as
dialogues, complete with annotations of gesture placement and
accompanying gesture forms. The corpus includes dialogues
generated by human annotators, gesture annotations on the human
generated dialogues, videos of story dialogues generated from this
representation, video clips of each gesture used in the gesture
annotations, and annotations of the original personal narratives
with a deep representation of story called a Story Intention Graph.
Our long term goal is the automatic generation of story co-tellings
as animated dialogues from the Story Intention Graph. We expect
this corpus to be a useful resource for researchers interested in
natural language generation, intelligent virtual agents, generation
of nonverbal behavior, and story and narrative representations.
Multimodal Resources for Human-RobotCommunication Modelling
Stavroula–Evita Fotinea, Eleni Efthimiou, MariaKoutsombogera, Athanasia-Lida Dimou, Theodore Goulasand Kyriaki Vasilaki
This paper reports on work related to the modelling of
Human-Robot Communication on the basis of multimodal and
multisensory human behaviour analysis. A primary focus
in this framework of analysis is the definition of semantics
of human actions in interaction, their capture and their
representation in terms of behavioural patterns that, in turn, feed
a multimodal human-robot communication system. Semantic
analysis encompasses both oral and sign languages, as well as
both verbal and non-verbal communicative signals to achieve an
effective, natural interaction between elderly users with slight
walking and cognitive inability and an assistive robotic platform.
A Verbal and Gestural Corpus of Story Retellingsto an Expressive Embodied Virtual Character
Jackson Tolins, Kris Liu, Michael Neff, Marilyn Walker andJean Fox Tree
We present a corpus of 44 human-agent verbal and gestural story
retellings designed to explore whether humans would gesturally
entrain to an embodied intelligent virtual agent. We used a
novel data collection method where an agent presented story
components in installments, which the human would then retell
to the agent. At the end of the installments, the human would then
retell the embodied animated agent the story as a whole. This
method was designed to allow us to observe whether changes
in the agent’s gestural behavior would result in human gestural
changes. The agent modified its gestures over the course of the
story, by starting out the first installment with gestural behaviors
designed to manifest extraversion, and slowly modifying gestures
to express introversion over time, or the reverse. The corpus
contains the verbal and gestural transcripts of the human story
retellings. The gestures were coded for type, handedness,
temporal structure, spatial extent, and the degree to which the
120
participants’ gestures match those produced by the agent. The
corpus illustrates the variation in expressive behaviors produced
by users interacting with embodied virtual characters, and the
degree to which their gestures were influenced by the agent’s
dynamic changes in personality-based expressive style.
A Multimodal Motion-Captured Corpus ofMatched and Mismatched Extravert-IntrovertConversational Pairs
Jackson Tolins, Kris Liu, Yingying Wang, Jean Fox Tree,Marilyn Walker and Michael Neff
This paper presents a new corpus, the Personality Dyads Corpus,
consisting of multimodal data for three conversations between
three personality-matched, two-person dyads (a total of 9 separate
dialogues). Participants were selected from a larger sample to
be 0.8 of a standard deviation above or below the mean on the
Big-Five Personality extraversion scale, to produce an Extravert-
Extravert dyad, an Introvert-Introvert dyad, and an Extravert-
Introvert dyad. Each pair carried out conversations for three
different tasks. The conversations were recorded using optical
motion capture for the body and data gloves for the hands. Dyads’
speech was transcribed and the gestural and postural behavior was
annotated with ANVIL. The released corpus includes personality
profiles, ANVIL files containing speech transcriptions and the
gestural annotations, and BVH files containing body and hand
motion in 3D.
O38 - CrowdsourcingFriday, May 27, 9:45
Chairperson: Andrejs Vasiljevs Oral Session
Crowdsourcing Ontology Lexicons
Bettina Lanser, Christina Unger and Philipp Cimiano
In order to make the growing amount of conceptual knowledge
available through ontologies and datasets accessible to humans,
NLP applications need access to information on how this
knowledge can be verbalized in natural language. One way to
provide this kind of information are ontology lexicons, which
apart from the actual verbalizations in a given target language can
provide further, rich linguistic information about them. Compiling
such lexicons manually is a very time-consuming task and
requires expertise both in Semantic Web technologies and lexicon
engineering, as well as a very good knowledge of the target
language at hand. In this paper we present an alternative approach
to generating ontology lexicons by means of crowdsourcing: We
use CrowdFlower to generate a small Japanese ontology lexicon
for ten exemplary ontology elements from the DBpedia ontology
according to a two-stage workflow, the main underlying idea of
which is to turn the task of generating lexicon entries into a
translation task; the starting point of this translation task is a
manually created English lexicon for DBpedia. Comparison of
the results to a manually created Japanese lexicon shows that the
presented workflow is a viable option if an English seed lexicon is
already available.
InScript: Narrative texts annotated with scriptinformationAshutosh Modi, Tatjana Anikina, Simon Ostermann andManfred Pinkal
This paper presents the InScript corpus (Narrative Texts
Instantiating Script structure). InScript is a corpus of 1,000 stories
centered around 10 different scenarios. Verbs and noun phrases
are annotated with event and participant types, respectively.
Additionally, the text is annotated with coreference information.
The corpus shows rich lexical variation and will serve as a unique
resource for the study of the role of script knowledge in natural
language processing.
A Crowdsourced Database of Event SequenceDescriptions for the Acquisition of High-qualityScript KnowledgeLilian D. A. Wanzare, Alessandra Zarcone, Stefan Thaterand Manfred Pinkal
Scripts are standardized event sequences describing typical
everyday activities, which play an important role in the
computational modeling of cognitive abilities (in particular
for natural language processing). We present a large-scale
crowdsourced collection of explicit linguistic descriptions of
script-specific event sequences (40 scenarios with 100 sequences
each). The corpus is enriched with crowdsourced alignment
annotation on a subset of the event descriptions, to be used
in future work as seed data for automatic alignment of event
descriptions (for example via clustering). The event descriptions
to be aligned were chosen among those expected to have the
strongest corrective effect on the clustering algorithm. The
alignment annotation was evaluated against a gold standard of
expert annotators. The resulting database of partially-aligned
script-event descriptions provides a sound empirical basis for
inducing high-quality script knowledge, as well as for any task
involving alignment and paraphrase detection of events.
Temporal Information Annotation: Crowd vs.ExpertsTommaso Caselli, Rachele Sprugnoli and Oana Inel
This paper describes two sets of crowdsourcing experiments on
temporal information annotation conducted on two languages,
i.e., English and Italian. The first experiment, launched on
the CrowdFlower platform, was aimed at classifying temporal
121
relations given target entities. The second one, relying on the
CrowdTruth metric, consisted in two subtasks: one devoted to
the recognition of events and temporal expressions and one to the
detection and classification of temporal relations. The outcomes
of the experiments suggest a valuable use of crowdsourcing
annotations also for a complex task like Temporal Processing.
A Tangled Web: The Faint Signals of Deception inText - Boulder Lies and Truth Corpus (BLT-C)
Franco Salvetti, John B. Lowe and James H. Martin
We present an approach to creating corpora for use in detecting
deception in text, including a discussion of the challenges peculiar
to this task. Our approach is based on soliciting several types
of reviews from writers and was implemented using Amazon
Mechanical Turk. We describe the multi-dimensional corpus
of reviews built using this approach, available free of charge
from LDC as the Boulder Lies and Truth Corpus (BLT-C).
Challenges for both corpus creation and the deception detection
include the fact that human performance on the task is typically
at chance, that the signal is faint, that paid writers such as
turkers are sometimes deceptive, and that deception is a complex
human behavior; manifestations of deception depend on details of
domain, intrinsic properties of the deceiver (such as education,
linguistic competence, and the nature of the intention), and
specifics of the deceptive act (e.g., lying vs. fabricating.) To
overcome the inherent lack of ground truth, we have developed
a set of semi-automatic techniques to ensure corpus validity. We
present some preliminary results on the task of deception detection
which suggest that the BLT-C is an improvement in the quality of
resources available for this task.
O39 - Corpora for Machine TranslationFriday, May 27, 9:45
Chairperson: Christopher Cieri Oral Session
Finding Alternative Translations in a LargeCorpus of Movie Subtitle
Jörg Tiedemann
OpenSubtitles.org provides a large collection of user contributed
subtitles in various languages for movies and TV programs.
Subtitle translations are valuable resources for cross-lingual
studies and machine translation research. A less explored feature
of the collection is the inclusion of alternative translations, which
can be very useful for training paraphrase systems or collecting
multi-reference test suites for machine translation. However,
differences in translation may also be due to misspellings,
incomplete or corrupt data files, or wrongly aligned subtitles. This
paper reports our efforts in recognising and classifying alternative
subtitle translations with language independent techniques.
We use time-based alignment with lexical re-synchronisation
techniques and BLEU score filters and sort alternative translations
into categories using edit distance metrics and heuristic rules. Our
approach produces large numbers of sentence-aligned translation
alternatives for over 50 languages provided via the OPUS corpus
collection.
Exploiting a Large Strongly Comparable CorpusThierry Etchegoyhen, Andoni Azpeitia and Naiara Pérez
This article describes a large comparable corpus for Basque
and Spanish and the methods employed to build a parallel
resource from the original data. The EITB corpus, a strongly
comparable corpus in the news domain, is to be shared with the
research community, as an aid for the development and testing
of methods in comparable corpora exploitation, and as basis for
the improvement of data-driven machine translation systems for
this language pair. Competing approaches were explored for the
alignment of comparable segments in the corpus, resulting in
the design of a simple method which outperformed a state-of-
the-art method on the corpus test sets. The method we present
is highly portable, computationally efficient, and significantly
reduces deployment work, a welcome result for the exploitation
of comparable corpora.
The United Nations Parallel Corpus v1.0Michał Ziemski, Marcin Junczys-Dowmunt and BrunoPouliquen
This paper describes the creation process and statistics of the
official United Nations Parallel Corpus, the first parallel corpus
composed from United Nations documents published by the
original data creator. The parallel corpus presented consists of
manually translated UN documents from the last 25 years (1990 to
2014) for the six official UN languages, Arabic, Chinese, English,
French, Russian, and Spanish. The corpus is freely available
for download under a liberal license. Apart from the pairwise
aligned documents, a fully aligned subcorpus for the six official
UN languages is distributed. We provide baseline BLEU scores
of our Moses-based SMT systems trained with the full data of
language pairs involving English and for all possible translation
directions of the six-way subcorpus.
WAGS: A Beautiful English-Italian BenchmarkSupporting Word Alignment Evaluation on RareWordsLuisa Bentivogli, Mauro Cettolo, M. Amin Farajian andMarcello Federico
This paper presents WAGS (Word Alignment Gold Standard),
a novel benchmark which allows extensive evaluation of WA
122
tools on out-of-vocabulary (OOV) and rare words. WAGS is
a subset of the Common Test section of the Europarl English-
Italian parallel corpus, and is specifically tailored to OOV and rare
words. WAGS is composed of 6,715 sentence pairs containing
11,958 occurrences of OOV and rare words up to frequency 15 in
the Europarl Training set (5,080 English words and 6,878 Italian
words), representing almost 3% of the whole text. Since WAGS
is focused on OOV/rare words, manual alignments are provided
for these words only, and not for the whole sentences. Two off-
the-shelf word aligners have been evaluated on WAGS, and results
have been compared to those obtained on an existing benchmark
tailored to full text alignment. The results obtained confirm that
WAGS is a valuable resource, which allows a statistically sound
evaluation of WA systems’ performance on OOV and rare words,
as well as extensive data analyses. WAGS is publicly released
under a Creative Commons Attribution license.
Manual and Automatic Paraphrases for MTEvaluation
Aleš Tamchyna and Petra Barancikova
Paraphrasing of reference translations has been shown to improve
the correlation with human judgements in automatic evaluation
of machine translation (MT) outputs. In this work, we present
a new dataset for evaluating English-Czech translation based
on automatic paraphrases. We compare this dataset with an
existing set of manually created paraphrases and find that even
automatic paraphrases can improve MT evaluation. We have
also propose and evaluate several criteria for selecting suitable
reference translations from a larger set.
O40 - Treebanks and Syntactic and SemanticAnalysisFriday, May 27, 9:45
Chairperson: Joakim Nivre Oral Session
Poly-GrETEL: Cross-Lingual Example-basedQuerying of Syntactic Constructions
Liesbeth Augustinus, Vincent Vandeghinste and TomVanallemeersch
We present Poly-GrETEL, an online tool which enables syntactic
querying in parallel treebanks, based on the monolingual GrETEL
environment. We provide online access to the Europarl parallel
treebank for Dutch and English, allowing users to query the
treebank using either an XPath expression or an example sentence
in order to look for similar constructions. We provide automatic
alignments between the nodes. By combining example-based
query functionality with node alignments, we limit the need for
users to be familiar with the query language and the structure of
the trees in the source and target language, thus facilitating the
use of parallel corpora for comparative linguistics and translation
studies.
NorGramBank: A ‘Deep’ Treebank forNorwegian
Helge Dyvik, Paul Meurer, Victoria Rosén, Koenraad DeSmedt, Petter Haugereid, Gyri Smørdal Losnegaard, GunnInger Lyse and Martha Thunes
We present NorGramBank, a treebank for Norwegian with highly
detailed LFG analyses. It is one of many treebanks made available
through the INESS treebanking infrastructure. NorGramBank
was constructed as a parsebank, i.e. by automatically parsing a
corpus, using the wide coverage grammar NorGram. One part
consisting of 350,000 words has been manually disambiguated
using computer-generated discriminants. A larger part of 50
M words has been stochastically disambiguated. The treebank
is dynamic: by global reparsing at certain intervals it is kept
compatible with the latest versions of the grammar and the
lexicon, which are continually further developed in interaction
with the annotators. A powerful query language, INESS Search,
has been developed for search across formalisms in the INESS
treebanks, including LFG c- and f-structures. Evaluation shows
that the grammar provides about 85% of randomly selected
sentences with good analyses. Agreement among the annotators
responsible for manual disambiguation is satisfactory, but also
suggests desirable simplifications of the grammar.
Accurate Deep Syntactic Parsing of Graphs: TheCase of French
Corentin Ribeyre, Eric Villemonte de la Clergerie andDjamé Seddah
Parsing predicate-argument structures in a deep syntax framework
requires graphs to be predicted. Argument structures represent
a higher level of abstraction than the syntactic ones and are
thus more difficult to predict even for highly accurate parsing
models on surfacic syntax. In this paper we investigate deep
syntax parsing, using a French data set (Ribeyre et al., 2014a).
We demonstrate that the use of topologically different types of
syntactic features, such as dependencies, tree fragments, spines
or syntactic paths, brings a much needed context to the parser.
Our higher-order parsing model, gaining thus up to 4 points,
establishes the state of the art for parsing French deep syntactic
structures.
123
Explicit Fine grained Syntactic and SemanticAnnotation of the Idafa Construction in Arabic
Abdelati Hawwari, Mohammed Attia, Mahmoud Ghoneimand Mona Diab
Idafa in traditional Arabic grammar is an umbrella construction
that covers several phenomena including what is expressed
in English as noun-noun compounds and Saxon and Norman
genitives. Additionally, Idafa participates in some other
constructions, such as quantifiers, quasi-prepositions, and
adjectives. Identifying the various types of the Idafa construction
(IC) is of importance to Natural Language processing (NLP)
applications. Noun-Noun compounds exhibit special behavior in
most languages impacting their semantic interpretation. Hence
distinguishing them could have an impact on downstream NLP
applications. The most comprehensive syntactic representation
of the Arabic language is the LDC Arabic Treebank (ATB). In
the ATB, ICs are not explicitly labeled and furthermore, there
is no distinction between ICs of noun-noun relations and other
traditional ICs. Hence, we devise a detailed syntactic and
semantic typification process of the IC phenomenon in Arabic.
We target the ATB as a platform for this classification. We render
the ATB annotated with explicit IC labels but with the further
semantic characterization which is useful for syntactic, semantic
and cross language processing. Our typification of IC comprises
3 main syntactic IC types: FIC, GIC, and TIC, and they are
further divided into 10 syntactic subclasses. The TIC group is
further classified into semantic relations. We devise a method for
automatic IC labeling and compare its yield against the CATiB
treebank. Our evaluation shows that we achieve the same level
of accuracy, but with the additional fine-grained classification into
the various syntactic and semantic types.
P44 - Corpus Creation and Querying (1)Friday, May 27, 9:45
Chairperson: Cristina Bosco Poster Session
Compasses, Magnets, Water Microscopes:Annotation of Terminology in a DiachronicCorpus of Scientific Texts
Anne-Kathrin Schumann and Stefan Fischer
The specialised lexicon belongs to the most prominent attributes
of specialised writing: Terms function as semantically dense
encodings of specialised concepts, which, in the absence of terms,
would require lengthy explanations and descriptions. In this paper,
we argue that terms are the result of diachronic processes on
both the semantic and the morpho-syntactic level. Very little
is known about these processes. We therefore present a corpus
annotation project aiming at revealing how terms are coined
and how they evolve to fit their function as semantically and
morpho-syntactically dense encodings of specialised knowledge.
The scope of this paper is two-fold: Firstly, we outline our
methodology for annotating terminology in a diachronic corpus of
scientific publications. Moreover, we provide a detailed analysis
of our annotation results and suggest methods for improving the
accuracy of annotations in a setting as difficult as ours. Secondly,
we present results of a pilot study based on the annotated terms.
The results suggest that terms in older texts are linguistically
relatively simple units that are hard to distinguish from the lexicon
of general language. We believe that this supports our hypothesis
that terminology undergoes diachronic processes of densification
and specialisation.
KorAP Architecture – Diving in the Deep Sea ofCorpus Data
Nils Diewald, Michael Hanl, Eliza Margaretha, JoachimBingel, Marc Kupietz, Piotr Banski and Andreas Witt
KorAP is a corpus search and analysis platform, developed at the
Institute for the German Language (IDS). It supports very large
corpora with multiple annotation layers, multiple query languages,
and complex licensing scenarios. KorAP’s design aims to be
scalable, flexible, and sustainable to serve the German Reference
Corpus DeReKo for at least the next decade. To meet these
requirements, we have adopted a highly modular microservice-
based architecture. This paper outlines our approach: An
architecture consisting of small components that are easy to
extend, replace, and maintain. The components include a search
backend, a user and corpus license management system, and a
web-based user frontend. We also describe a general corpus query
protocol used by all microservices for internal communications.
KorAP is open source, licensed under BSD-2, and available on
GitHub.
Text Segmentation of Digitized Clinical Texts
Cyril Grouin
In this paper, we present the experiments we made to recover
the original page layout structure into two columns from layout
damaged digitized files. We designed several CRF-based
approaches, either to identify column separator or to classify each
token from each line into left or right columns. We achieved our
best results with a model trained on homogeneous corpora (only
files composed of 2 columns) when classifying each token into left
or right columns (overall F-measure of 0.968). Our experiments
124
show it is possible to recover the original layout in columns of
digitized documents with results of quality.
A Turkish Database for Psycholinguistic StudiesBased on Frequency, Age of Acquisition, andImageability
Elif Ahsen Acar, Deniz Zeyrek, Murathan Kurfalı and CemBozsahin
This study primarily aims to build a Turkish psycholinguistic
database including three variables: word frequency, age of
acquisition (AoA), and imageability, where AoA and imageability
information are limited to nouns. We used a corpus-based
approach to obtain information about the AoA variable. We built
two corpora: a child literature corpus (CLC) including 535 books
written for 3-12 years old children, and a corpus of transcribed
children’s speech (CSC) at ages 1;4-4;8. A comparison between
the word frequencies of CLC and CSC gave positive correlation
results, suggesting the usability of the CLC to extract AoA
information. We assumed that frequent words of the CLC would
correspond to early acquired words whereas frequent words of a
corpus of adult language would correspond to late acquired words.
To validate AoA results from our corpus-based approach, a rated
AoA questionnaire was conducted on adults. Imageability values
were collected via a different questionnaire conducted on adults.
We conclude that it is possible to deduce AoA information for
high frequency words with the corpus-based approach. The results
about low frequency words were inconclusive, which is attributed
to the fact that corpus-based AoA information is affected by the
strong negative correlation between corpus frequency and rated
AoA.
Domain-Specific Corpus Expansion with FocusedWebcrawling
Steffen Remus and Chris Biemann
This work presents a straightforward method for extending or
creating in-domain web corpora by focused webcrawling. The
focused webcrawler uses statistical N-gram language models to
estimate the relatedness of documents and weblinks and needs
as input only N-grams or plain texts of a predefined domain
and seed URLs as starting points. Two experiments demonstrate
that our focused crawler is able to stay focused in domain
and language. The first experiment shows that the crawler
stays in a focused domain, the second experiment demonstrates
that language models trained on focused crawls obtain better
perplexity scores on in-domain corpora. We distribute the focused
crawler as open source software.
Corpus-Based Diacritic Restoration for SouthSlavic Languages
Nikola Ljubešic, Tomaž Erjavec and Darja Fišer
In computer-mediated communication, Latin-based scripts users
often omit diacritics when writing. Such text is typically easily
understandable to humans but very difficult for computational
processing because many words become ambiguous or unknown.
Letter-level approaches to diacritic restoration generalise better
and do not require a lot of training data but word-level approaches
tend to yield better results. However, they typically rely on a
lexicon which is an expensive resource, not covering non-standard
forms, and often not available for less-resourced languages. In
this paper we present diacritic restoration models that are trained
on easy-to-acquire corpora. We test three different types of
corpora (Wikipedia, general web, Twitter) for three South Slavic
languages (Croatian, Serbian and Slovene) and evaluate them
on two types of text: standard (Wikipedia) and non-standard
(Twitter). The proposed approach considerably outperforms
charlifter, so far the only open source tool available for this task.
We make the best performing systems freely available.
Automatic Recognition of LinguisticReplacements in Text Series Generated fromKeystroke Logs
Daniel Couto-Vale, Stella Neumann and Paula Niemietz
This paper introduces a toolkit used for the purpose of
detecting replacements of different grammatical and semantic
structures in ongoing text production logged as a chronological
series of computer interaction events (so-called keystroke logs).
The specific case we use involves human translations where
replacements can be indicative of translator behaviour that leads
to specific features of translations that distinguish them from
non-translated texts. The toolkit uses a novel CCG chart parser
customised so as to recognise grammatical words independently
of space and punctuation boundaries. On the basis of the
linguistic analysis, structures in different versions of the target
text are compared and classified as potential equivalents of the
same source text segment by ‘equivalence judges’. In that way,
replacements of grammatical and semantic structures can be
detected. Beyond the specific task at hand the approach will also
be useful for the analysis of other types of spaceless text such as
125
Twitter hashtags and texts in agglutinative or spaceless languages
like Finnish or Chinese.
Automatic Corpus Extension for Data-drivenNatural Language GenerationElena Manishina, Bassam Jabaian, Stéphane Huet andFabrice Lefevre
As data-driven approaches started to make their way into the
Natural Language Generation (NLG) domain, the need for
automation of corpus building and extension became apparent.
Corpus creation and extension in data-driven NLG domain
traditionally involved manual paraphrasing performed by either
a group of experts or with resort to crowd-sourcing. Building the
training corpora manually is a costly enterprise which requires a
lot of time and human resources. We propose to automate the
process of corpus extension by integrating automatically obtained
synonyms and paraphrases. Our methodology allowed us to
significantly increase the size of the training corpus and its level
of variability (the number of distinct tokens and specific syntactic
structures). Our extension solutions are fully automatic and
require only some initial validation. The human evaluation results
confirm that in many cases native speakers favor the outputs of the
model built on the extended corpus.
Bilbo-Val: Automatic Identification ofBibliographical Zone in PapersAmal Htait, Sebastien Fournier and Patrice Bellot
In this paper, we present the automatic annotation of
bibliographical references’ zone in papers and articles of
XML/TEI format. Our work is applied through two phases: first,
we use machine learning technology to classify bibliographical
and non-bibliographical paragraphs in papers, by means of a
model that was initially created to differentiate between the
footnotes containing or not containing bibliographical references.
The previous description is one of BILBO’s features, which is an
open source software for automatic annotation of bibliographic
reference. Also, we suggest some methods to minimize the margin
of error. Second, we propose an algorithm to find the largest list of
bibliographical references in the article. The improvement applied
on our model results an increase in the model’s efficiency with an
Accuracy equal to 85.89. And by testing our work, we are able
to achieve 72.23% as an average for the percentage of success in
detecting bibliographical references’ zone.
Large Scale Arabic Diacritized Corpus:Guidelines and FrameworkWajdi Zaghouani, Houda Bouamor, Abdelati Hawwari,Mona Diab, Ossama Obeid, Mahmoud Ghoneim, SawsanAlqahtani and Kemal Oflazer
This paper presents the annotation guidelines developed as part
of an effort to create a large scale manually diacritized corpus
for various Arabic text genres. The target size of the annotated
corpus is 2 million words. We summarize the guidelines and
describe issues encountered during the training of the annotators.
We also discuss the challenges posed by the complexity of the
Arabic language and how they are addressed. Finally, we present
the diacritization annotation procedure and detail the quality of the
resulting annotations.
P45 - Evaluation Methodologies (3)Friday, May 27, 9:45
Chairperson: Marta Villegas Poster Session
Applying the Cognitive Machine TranslationEvaluation Approach to Arabic
Irina Temnikova, Wajdi Zaghouani, Stephan Vogel andNizar Habash
The goal of the cognitive machine translation (MT) evaluation
approach is to build classifiers which assign post-editing effort
scores to new texts. The approach helps estimate fair
compensation for post-editors in the translation industry by
evaluating the cognitive difficulty of post-editing MT output.
The approach counts the number of errors classified in different
categories on the basis of how much cognitive effort they require
in order to be corrected. In this paper, we present the results
of applying an existing cognitive evaluation approach to Modern
Standard Arabic (MSA). We provide a comparison of the number
of errors and categories of errors in three MSA texts of different
MT quality (without any language-specific adaptation), as well
as a comparison between MSA texts and texts from three Indo-
European languages (Russian, Spanish, and Bulgarian), taken
from a previous experiment. The results show how the error
distributions change passing from the MSA texts of worse MT
quality to MSA texts of better MT quality, as well as a similarity
in distinguishing the texts of better MT quality for all four
languages.
A Reading Comprehension Corpus for MachineTranslation Evaluation
Carolina Scarton and Lucia Specia
Effectively assessing Natural Language Processing output tasks
is a challenge for research in the area. In the case of Machine
Translation (MT), automatic metrics are usually preferred over
human evaluation, given time and budget constraints.However,
traditional automatic metrics (such as BLEU) are not reliable for
absolute quality assessment of documents, often producing similar
scores for documents translated by the same MT system.For
126
scenarios where absolute labels are necessary for building models,
such as document-level Quality Estimation, these metrics can not
be fully trusted. In this paper, we introduce a corpus of reading
comprehension tests based on machine translated documents,
where we evaluate documents based on answers to questions by
fluent speakers of the target language. We describe the process
of creating such a resource, the experiment design and agreement
between the test takers. Finally, we discuss ways to convert the
reading comprehension test into document-level quality scores.
B2SG: a TOEFL-like Task for Portuguese
Rodrigo Wilkens, Leonardo Zilio, Eduardo Ferreira andAline Villavicencio
Resources such as WordNet are useful for NLP applications,
but their manual construction consumes time and personnel,
and frequently results in low coverage. One alternative is
the automatic construction of large resources from corpora like
distributional thesauri, containing semantically associated words.
However, as they may contain noise, there is a strong need
for automatic ways of evaluating the quality of the resulting
resource. This paper introduces a gold standard that can aid in this
task. The BabelNet-Based Semantic Gold Standard (B2SG) was
automatically constructed based on BabelNet and partly evaluated
by human judges. It consists of sets of tests that present one target
word, one related word and three unrelated words. B2SG contains
2,875 validated relations: 800 for verbs and 2,075 for nouns; these
relations are divided among synonymy, antonymy and hypernymy.
They can be used as the basis for evaluating the accuracy of the
similarity relations on distributional thesauri by comparing the
proximity of the target word with the related and unrelated options
and observing if the related word has the highest similarity value
among them. As a case study two distributional thesauri were
also developed: one using surface forms from a large (1.5 billion
word) corpus and the other using lemmatized forms from a smaller
(409 million word) corpus. Both distributional thesauri were then
evaluated against B2SG, and the one using lemmatized forms
performed slightly better.
MoBiL: A Hybrid Feature Set for AutomaticHuman Translation Quality Assessment
Yu Yuan, Serge Sharoff and Bogdan Babych
In this paper we introduce MoBiL, a hybrid Monolingual,
Bilingual and Language modelling feature set and feature
selection and evaluation framework. The set includes translation
quality indicators that can be utilized to automatically predict
the quality of human translations in terms of content adequacy
and language fluency. We compare MoBiL with the QuEst
baseline set by using them in classifiers trained with support
vector machine and relevance vector machine learning algorithms
on the same data set. We also report an experiment on feature
selection to opt for fewer but more informative features from
MoBiL. Our experiments show that classifiers trained on our
feature set perform consistently better in predicting both adequacy
and fluency than the classifiers trained on the baseline feature set.
MoBiL also performs well when used with both support vector
machine and relevance vector machine algorithms.
MARMOT: A Toolkit for Translation QualityEstimation at the Word LevelVarvara Logacheva, Chris Hokamp and Lucia Specia
We present Marmot – a new toolkit for quality estimation (QE)
of machine translation output. Marmot contains utilities targeted
at quality estimation at the word and phrase level. However,
due to its flexibility and modularity, it can also be extended to
work at the sentence level. In addition, it can be used as a
framework for extracting features and learning models for many
common natural language processing tasks. The tool has a set
of state-of-the-art features for QE, and new features can easily
be added. The tool is open-source and can be downloaded from
https://github.com/qe-team/marmot/
RankDCG: Rank-Ordering Evaluation MeasureDenys Katerenchuk and Andrew Rosenberg
Ranking is used for a wide array of problems, most notably
information retrieval (search). Kendall’s , Average Precision,
and nDCG are a few popular approaches to the evaluation of
ranking. When dealing with problems such as user ranking or
recommendation systems, all these measures suffer from various
problems, including the inability to deal with elements of the
same rank, inconsistent and ambiguous lower bound scores, and
an inappropriate cost function. We propose a new measure, a
modification of the popular nDCG algorithm, named rankDCG,
that addresses these problems. We provide a number of criteria
for any effective ranking algorithm and show that only rankDCG
satisfies them all. Results are presented on constructed and real
data sets. We release a publicly available rankDCG evaluation
package.
Spanish Word Vectors from WikipediaMathias Etcheverry and Dina Wonsever
Contents analisys from text data requires semantic representations
that are difficult to obtain automatically, as they may require large
handcrafted knowledge bases or manually annotated examples.
Unsupervised autonomous methods for generating semantic
representations are of greatest interest in face of huge volumes
of text to be exploited in all kinds of applications. In this work we
describe the generation and validation of semantic representations
in the vector space paradigm for Spanish. The method used is
127
GloVe (Pennington, 2014), one of the best performing reported
methods , and vectors were trained over Spanish Wikipedia. The
learned vectors evaluation is done in terms of word analogy
and similarity tasks (Pennington, 2014; Baroni, 2014; Mikolov,
2013a). The vector set and a Spanish version for some widely
used semantic relatedness tests are made publicly available.
P46 - Information Extraction and Retrieval(3)Friday, May 27, 9:45
Chairperson: Aurelie Neveol Poster Session
Analyzing Pre-processing Settings for UrduSingle-document Extractive Summarization
Muhammad Humayoun and Hwanjo Yu
Preprocessing is a preliminary step in many fields including IR
and NLP. The effect of basic preprocessing settings on English
for text summarization is well-studied. However, there is no
such effort found for the Urdu language (with the best of our
knowledge). In this study, we analyze the effect of basic
preprocessing settings for single-document text summarization
for Urdu, on a benchmark corpus using various experiments.
The analysis is performed using the state-of-the-art algorithms
for extractive summarization and the effect of stopword removal,
lemmatization, and stemming is analyzed. Results showed that
these pre-processing settings improve the results.
Semantic Annotation of the ACL AnthologyCorpus for the Automatic Analysis of ScientificLiterature
Kata Gábor, Haifa Zargayouna, Davide Buscaldi, IsabelleTellier and Thierry Charnois
This paper describes the process of creating a corpus annotated for
concepts and semantic relations in the scientific domain. A part
of the ACL Anthology Corpus was selected for annotation, but
the annotation process itself is not specific to the computational
linguistics domain and could be applied to any scientific corpora.
Concepts were identified and annotated fully automatically,
based on a combination of terminology extraction and available
ontological resources. A typology of semantic relations between
concepts is also proposed. This typology, consisting of 18
domain-specific and 3 generic relations, is the result of a corpus-
based investigation of the text sequences occurring between
concepts in sentences. A sample of 500 abstracts from the
corpus is currently being manually annotated with these semantic
relations. Only explicit relations are taken into account, so that
the data could serve to train or evaluate pattern-based semantic
relation classification systems.
GATE-Time: Extraction of Temporal Expressionsand Events
Leon Derczynski, Jannik Strötgen, Diana Maynard, MarkA. Greenwood and Manuel Jung
GATE is a widely used open-source solution for text processing
with a large user community. It contains components for
several natural language processing tasks. However, temporal
information extraction functionality within GATE has been rather
limited so far, despite being a prerequisite for many application
scenarios in the areas of natural language processing and
information retrieval. This paper presents an integrated approach
to temporal information processing. We take state-of-the-art
tools in temporal expression and event recognition and bring
them together to form an openly-available resource within the
GATE infrastructure. GATE-Time provides annotation in the
form of TimeML events and temporal expressions complying with
this mature ISO standard for temporal semantic annotation of
documents. Major advantages of GATE-Time are (i) that it relies
on HeidelTime for temporal tagging, so that temporal expressions
can be extracted and normalized in multiple languages and across
different domains, (ii) it includes a modern, fast event recognition
and classification tool, and (iii) that it can be combined with
different linguistic pre-processing annotations, and is thus not
bound to license restricted preprocessing components.
Distributional Thesauri for Information Retrievaland vice versa
Vincent Claveau and Ewa Kijak
Distributional thesauri are useful in many tasks of Natural
Language Processing. In this paper, we address the problem of
building and evaluating such thesauri with the help of Information
Retrieval (IR) concepts. Two main contributions are proposed.
First, following the work of [8], we show how IR tools and
concepts can be used with success to build a thesaurus. Through
several experiments and by evaluating directly the results with
reference lexicons, we show that some IR models outperform
state-of-the-art systems. Secondly, we use IR as an applicative
framework to indirectly evaluate the generated thesaurus. Here
again, this task-based evaluation validates the IR approach used
to build the thesaurus. Moreover, it allows us to compare these
results with those from the direct evaluation framework used in the
128
literature. The observed differences bring these evaluation habits
into question.
Parallel Chinese-English Entities, Relations andEvents Corpora
Justin Mott, Ann Bies, Zhiyi Song and Stephanie Strassel
This paper introduces the parallel Chinese-English Entities,
Relations and Events (ERE) corpora developed by Linguistic Data
Consortium under the DARPA Deep Exploration and Filtering of
Text (DEFT) Program. Original Chinese newswire and discussion
forum documents are annotated for two versions of the ERE task.
The texts are manually translated into English and then annotated
for the same ERE tasks on the English translation, resulting in
a rich parallel resource that has utility for performers within
the DEFT program, for participants in NIST’s Knowledge Base
Population evaluations, and for cross-language projection research
more generally.
The PsyMine Corpus - A Corpus annotated withPsychiatric Disorders and their EtiologicalFactors
Tilia Ellendorff, Simon Foster and Fabio Rinaldi
We present the first version of a corpus annotated for psychiatric
disorders and their etiological factors. The paper describes the
choice of text, annotated entities and events/relations as well
as the annotation scheme and procedure applied. The corpus
is featuring a selection of focus psychiatric disorders including
depressive disorder, anxiety disorder, obsessive-compulsive
disorder, phobic disorders and panic disorder. Etiological factors
for these focus disorders are widespread and include genetic,
physiological, sociological and environmental factors among
others. Etiological events, including annotated evidence text,
represent the interactions between their focus disorders and their
etiological factors. Additionally to these core events, symptomatic
and treatment events have been annotated. The current version
of the corpus includes 175 scientific abstracts. All entities
and events/relations have been manually annotated by domain
experts and scores of inter-annotator agreement are presented.
The aim of the corpus is to provide a first gold standard to
support the development of biomedical text mining applications
for the specific area of mental disorders which belong to the main
contributors to the contemporary burden of disease.
An Empirical Exploration of Moral FoundationsTheory in Partisan News Sources
Dean Fulgoni, Jordan Carpenter, Lyle Ungar and DanielPreotiuc-Pietro
News sources frame issues in different ways in order to appeal
or control the perception of their readers. We present a large
scale study of news articles from partisan sources in the US across
a variety of different issues. We first highlight that differences
between sides exist by predicting the political leaning of articles
of unseen political bias. Framing can be driven by different types
of morality that each group values. We emphasize differences
in framing of different news building on the moral foundations
theory quantified using hand crafted lexicons. Our results show
that partisan sources frame political issues differently both in
terms of words usage and through the moral foundations they
relate to.
Building a Dataset for Possessions Identification inText
Carmen Banea, Xi Chen and Rada Mihalcea
Just as industrialization matured from mass production to
customization and personalization, so has the Web migrated from
generic content to public disclosures of one’s most intimately held
thoughts, opinions and beliefs. This relatively new type of data is
able to represent finer and more narrowly defined demographic
slices. If until now researchers have primarily focused on
leveraging personalized content to identify latent information such
as gender, nationality, location, or age of the author, this study
seeks to establish a structured way of extracting possessions, or
items that people own or are entitled to, as a way to ultimately
provide insights into people’s behaviors and characteristics. In
order to promote more research in this area, we are releasing a set
of 798 possessions extracted from blog genre, where possessions
are marked at different confidence levels, as well as a detailed set
of guidelines to help in future annotation studies.
The Query of Everything: DevelopingOpen-Domain, Natural-Language Queries forBOLT Information Retrieval
Kira Griffitt and Stephanie Strassel
The DARPA BOLT Information Retrieval evaluations target open-
domain natural-language queries over a large corpus of informal
text in English, Chinese and Egyptian Arabic. We outline the
goals of BOLT IR, comparing it with the prior GALE Distillation
task. After discussing the properties of the BOLT IR corpus,
we provide a detailed description of the query creation process,
contrasting the summary query format presented to systems at run
time with the full query format created by annotators. We describe
the relevance criteria used to assess BOLT system responses,
highlighting the evolution of the procedures used over the three
evaluation phases. We provide a detailed review of the decision
points model for relevance assessment introduced during Phase
129
2, and conclude with information about inter-assessor consistency
achieved with the decision points assessment model.
The Validation of MRCPD Cross-languageExpansions on Imageability Ratings
Ting Liu, Kit Cho, Tomek Strzalkowski, Samira Shaikh andMehrdad Mirzaei
In this article, we present a method to validate a multi-lingual
(English, Spanish, Russian, and Farsi) corpus on imageability
ratings automatically expanded from MRCPD (Liu et al., 2014).
We employed the corpus (Brysbaert et al., 2014) on concreteness
ratings for our English MRCPD+ validation because of lacking
human assessed imageability ratings and high correlation between
concreteness ratings and imageability ratings (e.g. r = .83).
For the same reason, we built a small corpus with human
imageability assessment for the other language corpus validation.
The results show that the automatically expanded imageability
ratings are highly correlated with human assessment in all four
languages, which demonstrate our automatic expansion method
is valid and robust. We believe these new resources can be
of significant interest to the research community, particularly in
natural language processing and computational sociolinguistics.
Building Tempo-HindiWordNet: A resource foreffective temporal information access in Hindi
Dipawesh Pawar, Mohammed Hasanuzzaman and AsifEkbal
In this paper, we put forward a strategy that supplements Hindi
WordNet entries with information on the temporality of its word
senses. Each synset of Hindi WordNet is automatically annotated
to one of the five dimensions: past, present, future, neutral and
atemporal. We use semi-supervised learning strategy to build
temporal classifiers over the glosses of manually selected initial
seed synsets. The classification process is iterated based on the
repetitive confidence based expansion strategy of the initial seed
list until cross-validation accuracy drops. The resource is unique
in its nature as, to the best of our knowledge, still no such resource
is available for Hindi.
P47 - Semantic CorporaFriday, May 27, 9:45
Chairperson: Eneko Agirre Poster Session
Detection of Reformulations in Spoken French
Natalia Grabar and Iris Eshkol-Taravela
Our work addresses automatic detection of enunciations and
segments with reformulations in French spoken corpora. The
proposed approach is syntagmatic. It is based on reformulation
markers and specificities of spoken language. The reference data
are built manually and have gone through consensus. Automatic
methods, based on rules and CRF machine learning, are proposed
in order to detect the enunciations and segments that contain
reformulations. With the CRF models, different features are
exploited within a window of various sizes. Detection of
enunciations with reformulations shows up to 0.66 precision.
The tests performed for the detection of reformulated segments
indicate that the task remains difficult. The best average
performance values reach up to 0.65 F-measure, 0.75 precision,
and 0.63 recall. We have several perspectives to this work
for improving the detection of reformulated segments and for
studying the data from other points of view.
DT-Neg: Tutorial Dialogues Annotated forNegation Scope and Focus in ContextRajendra Banjade and Vasile Rus
Negation is often found more frequent in dialogue than commonly
written texts, such as literary texts. Furthermore, the scope and
focus of negation depends on context in dialogues than other
forms of texts. Existing negation datasets have focused on non-
dialogue texts such as literary texts where the scope and focus
of negation is normally present within the same sentence where
the negation is located and therefore are not the most appropriate
to inform the development of negation handling algorithms for
dialogue-based systems. In this paper, we present DT -Neg corpus
(DeepTutor Negation corpus) which contains texts extracted from
tutorial dialogues where students interacted with an Intelligent
Tutoring System (ITS) to solve conceptual physics problems. The
DT -Neg corpus contains annotated negations in student responses
with scope and focus marked based on the context of the dialogue.
Our dataset contains 1,088 instances and is available for research
purposes at http://language.memphis.edu/dt-neg.
Annotating Logical Forms for EHR QuestionsKirk Roberts and Dina Demner-Fushman
This paper discusses the creation of a semantically annotated
corpus of questions about patient data in electronic health records
(EHRs). The goal is provide the training data necessary for
semantic parsers to automatically convert EHR questions into a
structured query. A layered annotation strategy is used which
mirrors a typical natural language processing (NLP) pipeline.
First, questions are syntactically analyzed to identify multi-
part questions. Second, medical concepts are recognized and
normalized to a clinical ontology. Finally, logical forms are
created using a lambda calculus representation. We use a corpus
of 446 questions asking for patient-specific information. From
these, 468 specific questions are found containing 259 unique
medical concepts and requiring 53 unique predicates to represent
130
the logical forms. We further present detailed characteristics of the
corpus, including inter-annotator agreement results, and describe
the challenges automatic NLP systems will face on this task.
A Semantically Compositional AnnotationScheme for Time NormalizationSteven Bethard and Jonathan Parker
We present a new annotation scheme for normalizing time
expressions, such as“three days ago”, to computer-readable forms,
such as 2016-03-07. The annotation scheme addresses several
weaknesses of the existing TimeML standard, allowing the
representation of time expressions that align to more than one
calendar unit (e.g., “the past three summers”), that are defined
relative to events (e.g., “three weeks postoperative”), and that
are unions or intersections of smaller time expressions (e.g.,
“Tuesdays and Thursdays”). It achieves this by modeling time
expression interpretation as the semantic composition of temporal
operators like UNION, NEXT, and AFTER. We have applied
the annotation scheme to 34 documents so far, producing 1104
annotations, and achieving inter-annotator agreement of 0.821.
PROMETHEUS: A Corpus of ProverbsAnnotated with MetaphorsGözde Özbal, Carlo Strapparava and Serra SinemTekiroglu
Proverbs are commonly metaphoric in nature and the mapping
across domains is commonly established in proverbs. The
abundance of proverbs in terms of metaphors makes them an
extremely valuable linguistic resource since they can be utilized
as a gold standard for various metaphor related linguistic tasks
such as metaphor identification or interpretation. Besides, a
collection of proverbs fromvarious languages annotated with
metaphors would also be essential for social scientists to explore
the cultural differences betweenthose languages. In this paper,
we introduce PROMETHEUS, a dataset consisting of English
proverbs and their equivalents in Italian.In addition to the word-
level metaphor annotations for each proverb, PROMETHEUS
contains other types of information such as the metaphoricity
degree of the overall proverb, its meaning, the century that it was
first recorded in and a pair of subjective questions responded by
the annotators. To the best of our knowledge, this is the first multi-
lingual and open-domain corpus of proverbs annotated with word-
level metaphors.
Corpus Annotation within the French FrameNet:a Domain-by-domain MethodologyMarianne Djemaa, Marie Candito, Philippe Muller andLaure Vieu
This paper reports on the development of a French FrameNet,
within the ASFALDA project. While the first phase of the
project focused on the development of a French set of frames
and corresponding lexicon (Candito et al., 2014), this paper
concentrates on the subsequent corpus annotation phase, which
focused on four notional domains (commercial transactions,
cognitive stances, causality and verbal communication). Given
full coverage is not reachable for a relatively “new” FrameNet
project, we advocate that focusing on specific notional domains
allowed us to obtain full lexical coverage for the frames of
these domains, while partially reflecting word sense ambiguities.
Furthermore, as frames and roles were annotated on two French
Treebanks (the French Treebank (Abeillé and Barrier, 2004) and
the Sequoia Treebank (Candito and Seddah, 2012), we were
able to extract a syntactico-semantic lexicon from the annotated
frames. In the resource’s current status, there are 98 frames, 662
frame evoking words, 872 senses, and about 13000 annotated
frames, with their semantic roles assigned to portions of text. The
French FrameNet is freely available at alpage.inria.fr/asfalda.
Covering various Needs in Temporal Annotation:a Proposal of Extension of ISO TimeML thatPreserves Upward Compatibility
Anaïs Lefeuvre-Halftermeyer, Jean-Yves Antoine, AlainCouillault, Emmanuel Schang, Lotfi Abouda, Agata Savary,Denis Maurel, Iris Eshkol and Delphine Battistelli
This paper reports a critical analysis of the ISO TimeML standard,
in the light of several experiences of temporal annotation that were
conducted on spoken French. It shows that the norm suffers from
weaknesses that should be corrected to fit a larger variety of needs
inNLP and in corpus linguistics. We present our proposition of
some improvements of the norm before it will be revised by the
ISO Committee in 2017. These modifications concern mainly
(1) Enrichments of well identified features of the norm: temporal
function of TIMEX time expressions, additional types for TLINK
temporal relations; (2) Deeper modifications concerning the units
or features annotated: clarification between time and tense for
EVENT units, coherence of representation between temporal
signals (the SIGNAL unit) and TIMEX modifiers (the MOD
feature); (3) A recommendation to perform temporal annotation
on top of a syntactic (rather than lexical) layer (temporal
annotation on a treebank).
A General Framework for the Annotation ofCausality Based on FrameNet
Laure Vieu, Philippe Muller, Marie Candito and MarianneDjemaa
We present here a general set of semantic frames to annotate
causal expressions, with a rich lexicon in French and an annotated
corpus of about 5000 instances of causal lexical items with their
131
corresponding semantic frames. The aim of our project is to
have both the largest possible coverage of causal phenomena in
French, across all parts of speech, and have it linked to a general
semantic framework such as FN, to benefit in particular from
the relations between other semantic frames, e.g., temporal ones
or intentional ones, and the underlying upper lexical ontology
that enable some forms of reasoning. This is part of the larger
ASFALDA French FrameNet project, which focuses on a few
different notional domains which are interesting in their own
right (Djemma et al., 2016), including cognitive positions and
communication frames. In the process of building the French
lexicon and preparing the annotation of the corpus, we had to
remodel some of the frames proposed in FN based on English
data, with hopefully more precise frame definitions to facilitate
human annotation. This includes semantic clarifications of frames
and frame elements, redundancy elimination, and added coverage.
The result is arguably a significant improvement of the treatment
of causality in FN itself.
Annotating Temporally-Anchored SpatialKnowledge on Top of OntoNotes Semantic Roles
Alakananda Vempala and Eduardo Blanco
This paper presents a two-step methodology to annotate spatial
knowledge on top of OntoNotes semantic roles. First, we
manipulate semantic roles to automatically generate potential
additional spatial knowledge. Second, we crowdsource
annotations with Amazon Mechanical Turk to either validate
or discard the potential additional spatial knowledge. The
resulting annotations indicate whether entities are or are not
located somewhere with a degree of certainty, and temporally
anchor this spatial information. Crowdsourcing experiments
show that the additional spatial knowledge is ubiquitous and
intuitive to humans, and experimental results show that it can
be inferred automatically using standard supervised machine
learning techniques.
SpaceRef: A corpus of street-level geographicdescriptions
Jana Götze and Johan Boye
This article describes SPACEREF, a corpus of street-level
geographic descriptions. Pedestrians are walking a route in a
(real) urban environment, describing their actions. Their position
is automatically logged, their speech is manually transcribed, and
their references to objects are manually annotated with respect
to a crowdsourced geographic database. We describe how the
data was collected and annotated, and how it has been used
in the context of creating resources for an automatic pedestrian
navigation system.
Persian Proposition Bank
Azadeh Mirzaei and Amirsaeid Moloodi
This paper describes the procedure of semantic role labeling
and the development of the first manually annotated Persian
Proposition Bank (PerPB) which added a layer of predicate-
argument information to the syntactic structures of Persian
Dependency Treebank (known as PerDT). Through the process
of annotating, the annotators could see the syntactic information
of all the sentences and so they annotated 29982 sentences with
more than 9200 unique verbs. In the annotation procedure, the
direct syntactic dependents of the verbs were the first candidates
for being annotated. So we did not annotate the other indirect
dependents unless their phrasal heads were propositional and had
their own arguments or adjuncts. Hence besides the semantic
role labeling of verbs, the argument structure of 1300 unique
propositional nouns and 300 unique propositional adjectives were
annotated in the sentences, too. The accuracy of annotation
process was measured by double annotation of the data at two
separate stages and finally the data was prepared in the CoNLL
dependency format.
Typed Entity and Relation Annotation onComputer Science Papers
Yuka Tateisi, Tomoko Ohta, Sampo Pyysalo, Yusuke Miyaoand Akiko Aizawa
We describe our ongoing effort to establish an annotation scheme
for describing the semantic structures of research articles in the
computer science domain, with the intended use of developing
search systems that can refine their results by the roles of the
entities denoted by the query keys. In our scheme, mentions of
entities are annotated with ontology-based types, and the roles of
the entities are annotated as relations with other entities described
in the text. So far, we have annotated 400 abstracts from the ACL
anthology and the ACM digital library. In this paper, the scheme
and the annotated dataset are described, along with the problems
found in the course of annotation. We also show the results of
automatic annotation and evaluate the corpus in a practical setting
in application to topic extraction.
Enriching TimeBank: Towards a more preciseannotation of temporal relations in a text
Volker Gast, Lennart Bierkandt, Stephan Druskat andChristoph Rzymski
We propose a way of enriching the TimeML annotations of
TimeBank by adding information about the Topic Time in
132
terms of Klein (1994). The annotations are partly automatic,
partly inferential and partly manual. The corpus was converted
into the native format of the annotation software GraphAnno
and POS-tagged using the Stanford bidirectional dependency
network tagger. On top of each finite verb, a FIN-node
with tense information was created, and on top of any FIN-
node, a TOPICTIME-node, in accordance with Klein’s (1994)
treatment of finiteness as the linguistic correlate of the Topic
Time. Each TOPICTIME-node is linked to a MAKEINSTANCE-
node representing an (instantiated) event in TimeML (Pustejovsky
et al. 2005), the markup language used for the annotation
of TimeBank. For such links we introduce a new category,
ELINK. ELINKs capture the relationship between the Topic Time
(TT) and the Time of Situation (TSit) and have an aspectual
interpretation in Klein’s (1994) theory. In addition to these
automatic and inferential annotations, some TLINKs were added
manually. Using an example from the corpus, we show that the
inclusion of the Topic Time in the annotations allows for a richer
representation of the temporal structure than does TimeML. A
way of representing this structure in a diagrammatic form similar
to the T-Box format (Verhagen, 2007) is proposed.
P48 - Speech Processing (2)Friday, May 27, 9:45
Chairperson: Denise DiPersio Poster Session
How Diachronic Text Corpora Affect Contextbased Retrieval of OOV Proper Names for AudioNews
Imran Sheikh, Irina Illina and Dominique Fohr
Out-Of-Vocabulary (OOV) words missed by Large Vocabulary
Continuous Speech Recognition (LVCSR) systems can be
recovered with the help of topic and semantic context of the OOV
words captured from a diachronic text corpus. In this paper we
investigate how the choice of documents for the diachronic text
corpora affects the retrieval of OOV Proper Names (PNs) relevant
to an audio document. We first present our diachronic French
broadcast news datasets, which highlight the motivation of our
study on OOV PNs. Then the effect of using diachronic text data
from different sources and a different time span is analysed. With
OOV PN retrieval experiments on French broadcast news videos,
we conclude that a diachronic corpus with text from different
sources leads to better retrieval performance than one relying on
text from single source or from a longer time span.
Syllable based DNN-HMM Cantonese Speech toText System
Timothy Wong, Claire Li, Sam Lam, Billy Chiu, Qin Lu,Minglei Li, Dan Xiong, Roy Shing Yu and Vincent T.Y. Ng
This paper reports our work on building up a Cantonese Speech-
to-Text (STT) system with a syllable based acoustic model. This
is a part of an effort in building a STT system to aid dyslexic
students who have cognitive deficiency in writing skills but have
no problem expressing their ideas through speech. For Cantonese
speech recognition, the basic unit of acoustic models can either be
the conventional Initial-Final (IF) syllables, or the Onset-Nucleus-
Coda (ONC) syllables where finals are further split into nucleus
and coda to reflect the intra-syllable variations in Cantonese. By
using the Kaldi toolkit, our system is trained using the stochastic
gradient descent optimization model with the aid of GPUs for the
hybrid Deep Neural Network and Hidden Markov Model (DNN-
HMM) with and without I-vector based speaker adaptive training
technique. The input features of the same Gaussian Mixture
Model with speaker adaptive training (GMM-SAT) to DNN are
used in all cases. Experiments show that the ONC-based syllable
acoustic modeling with I-vector based DNN-HMM achieves the
best performance with the word error rate (WER) of 9.66% and
the real time factor (RTF) of 1.38812.
Collecting Resources in Sub-Saharan AfricanLanguages for Automatic Speech Recognition: aCase Study of Wolof
Elodie Gauthier, Laurent Besacier, Sylvie Voisin, MichaelMelese and Uriel Pascal Elingui
This article presents the data collected and ASR systems
developped for 4 sub-saharan african languages (Swahili, Hausa,
Amharic and Wolof). To illustrate our methodology, the focus is
made on Wolof (a very under-resourced language) for which we
designed the first ASR system ever built in this language. All data
and scripts are available online on our github repository.
SCALE: A Scalable Language EngineeringToolkit
Joris Pelemans, Lyan Verwimp, Kris Demuynck, Hugo Vanhamme and Patrick Wambacq
In this paper we present SCALE, a new Python toolkit that
contains two extensions to n-gram language models. The first
extension is a novel technique to model compound words called
Semantic Head Mapping (SHM). The second extension, Bag-of-
Words Language Modeling (BagLM), bundles popular models
133
such as Latent Semantic Analysis and Continuous Skip-grams.
Both extensions scale to large data and allow the integration into
first-pass ASR decoding. The toolkit is open source, includes
working examples and can be found on http://github.
com/jorispelemans/scale.
Combining Manual and Automatic ProsodicAnnotation for Expressive Speech Synthesis
Sandrine Brognaux, Thomas Francois and Marco Saerens
Text-to-speech has long been centered on the production of an
intelligible message of good quality. More recently, interest has
shifted to the generation of more natural and expressive speech.
A major issue of existing approaches is that they usually rely on
a manual annotation in expressive styles, which tends to be rather
subjective. A typical related issue is that the annotation is strongly
influenced – and possibly biased – by the semantic content of the
text (e.g. a shot or a fault may incite the annotator to tag that
sequence as expressing a high degree of excitation, independently
of its acoustic realization). This paper investigates the assumption
that human annotation of basketball commentaries in excitation
levels can be automatically improved on the basis of acoustic
features. It presents two techniques for label correction exploiting
a Gaussian mixture and a proportional-odds logistic regression.
The automatically re-annotated corpus is then used to train HMM-
based expressive speech synthesizers, the performance of which
is assessed through subjective evaluations. The results indicate
that the automatic correction of the annotation with Gaussian
mixture helps to synthesize more contrasted excitation levels,
while preserving naturalness.
BAS Speech Science Web Services - an Update ofCurrent Developments
Thomas Kisler, Uwe Reichel, Florian Schiel, ChristophDraxler, Bernhard Jackl and Nina Pörner
In 2012 the Bavarian Archive for Speech Signals started providing
some of its tools from the field of spoken language in the
form of Software as a Service (SaaS). This means users access
the processing functionality over a web browser and therefore
do not have to install complex software packages on a local
computer. Amongst others, these tools include segmentation
& labeling, grapheme-to-phoneme conversion, text alignment,
syllabification and metadata generation, where all but the last
are available for a variety of languages. Since its creation the
number of available services and the web interface have changed
considerably. We give an overview and a detailed description
of the system architecture, the available web services and their
functionality. Furthermore, we show how the number of files
processed over the system developed in the last four years.
SPA: Web-based Platform for easy Access toSpeech Processing Modules
Fernando Batista, Pedro Curto, Isabel Trancoso, AlbertoAbad, Jaime Ferreira, Eugénio Ribeiro, Helena Moniz,David Martins de Matos and Ricardo Ribeiro
This paper presents SPA, a web-based Speech Analytics platform
that integrates several speech processing modules and that makes
it possible to use them through the web. It was developed
with the aim of facilitating the usage of the modules, without
the need to know about software dependencies and specific
configurations. Apart from being accessed by a web-browser,
the platform also provides a REST API for easy integration
with other applications. The platform is flexible, scalable,
provides authentication for access restrictions, and was developed
taking into consideration the time and effort of providing new
services. The platform is still being improved, but it already
integrates a considerable number of audio and text processing
modules, including: Automatic transcription, speech disfluency
classification, emotion detection, dialog act recognition, age and
gender classification, non-nativeness detection, hyper-articulation
detection, dialog act recognition, and two external modules for
feature extraction and DTMF detection. This paper describes
the SPA architecture, presents the already integrated modules,
and provides a detailed description for the ones most recently
integrated.
Enhanced CORILGA: Introducing the AutomaticPhonetic Alignment Tool for Continuous Speech
Roberto Seara, Marta Martinez, Rocio Varela, CarmenGarcía Mateo, Elisa Fernandez Rei and Xose Luis Regueira
The “Corpus Oral Informatizado da Lingua Galega (CORILGA)”
project aims at building a corpus of oral language for Galician,
primarily designed to study the linguistic variation and change.
This project is currently under development and it is periodically
enriched with new contributions. The long-term goal is that
all the speech recordings will be enriched with phonetic,
syllabic, morphosyntactic, lexical and sentence ELAN-complaint
annotations. A way to speed up the process of annotation is
to use automatic speech-recognition-based tools tailored to the
application. Therefore, CORILGA repository has been enhanced
with an automatic alignment tool, available to the administrator
of the repository, that aligns speech with an orthographic
transcription. In the event that no transcription, or just a partial
one, were available, a speech recognizer for Galician is used to
134
generate word and phonetic segmentations. These recognized
outputs may contain errors that will have to be manually corrected
by the administrator. For assisting this task, the tool also provides
an ELAN tier with the confidence measure of each recognized
word. In this paper, after the description of the main facts of the
CORILGA corpus, the speech alignment and recognition tools are
described. Both have been developed using the Kaldi toolkit.
O41 - DiscourseFriday, May 27, 11:45
Chairperson: Justus Roux Oral Session
A Corpus of Argument Networks: Using GraphProperties to Analyse Divisive Issues
Barbara Konat, John Lawrence, Joonsuk Park, KatarzynaBudzynska and Chris Reed
Governments are increasingly utilising online platforms in order
to engage with, and ascertain the opinions of, their citizens. Whilst
policy makers could potentially benefit from such enormous
feedback from society, they first face the challenge of making
sense out of the large volumes of data produced. This creates a
demand for tools and technologies which will enable governments
to quickly and thoroughly digest the points being made and
to respond accordingly. By determining the argumentative and
dialogical structures contained within a debate, we are able
to determine the issues which are divisive and those which
attract agreement. This paper proposes a method of graph-based
analytics which uses properties of graphs representing networks
of arguments pro- & con- in order to automatically analyse issues
which divide citizens about new regulations. By future application
of the most recent advances in argument mining, the results
reported here will have a chance to scale up to enable sense-
making of the vast amount of feedback received from citizens on
directions that policy should take.
metaTED: a Corpus of Metadiscourse for SpokenLanguage
Rui Correia, Nuno Mamede, Jorge Baptista and MaxineEskenazi
This paper describes metaTED – a freely available corpus
of metadiscursive acts in spoken language collected via
crowdsourcing. Metadiscursive acts were annotated on a set
of 180 randomly chosen TED talks in English, spanning over
different speakers and topics. The taxonomy used for annotation
is composed of 16 categories, adapted from Adel(2010). This
adaptation takes into account both the material to annotate
and the setting in which the annotation task is performed.
The crowdsourcing setup is described, including considerations
regarding training and quality control. The collected data is
evaluated in terms of quantity of occurrences, inter-annotator
agreement, and annotation related measures (such as average time
on task and self-reported confidence). Results show different
levels of agreement among metadiscourse acts (α [0.15; 0.49]).
To further assess the collected material, a subset of the annotations
was submitted to expert appreciation, who validated which of
the marked occurrences truly correspond to instances of the
metadiscursive act at hand. Similarly to what happened with the
crowd, experts revealed different levels of agreement between
categories (α [0.18; 0.72]). The paper concludes with a
discussion on the applicability of metaTED with respect to each
of the 16 categories of metadiscourse.
PARC 3.0: A Corpus of Attribution Relations
Silvia Pareti
Quotation and opinion extraction, discourse and factuality have all
partly addressed the annotation and identification of Attribution
Relations. However, disjoint efforts have provided a partial and
partly inaccurate picture of attribution and generated small or
incomplete resources, thus limiting the applicability of machine
learning approaches. This paper presents PARC 3.0, a large
corpus fully annotated with Attribution Relations (ARs). The
annotation scheme was tested with an inter-annotator agreement
study showing satisfactory results for the identification of ARs and
high agreement on the selection of the text spans corresponding to
its constitutive elements: source, cue and content. The corpus,
which comprises around 20k ARs, was used to investigate the
range of structures that can express attribution. The results show a
complex and varied relation of which the literature has addressed
only a portion. PARC 3.0 is available for research use and can
be used in a range of different studies to analyse attribution and
validate assumptions as well as to develop supervised attribution
extraction models.
Improving the Annotation of Sentence Specificity
Junyi Jessy Li, Bridget O’Daniel, Yi Wu, Wenli Zhao andAni Nenkova
We introduce improved guidelines for annotation of sentence
specificity, addressing the issues encountered in prior work. Our
annotation provides judgements of sentences in context. Rather
than binary judgements, we introduce a specificity scale which
accommodates nuanced judgements. Our augmented annotation
procedure also allows us to define where in the discourse context
the lack of specificity can be resolved. In addition, the cause of the
underspecification is annotated in the form of free text questions.
135
We present results from a pilot annotation with this new scheme
and demonstrate good inter-annotator agreement. We found that
the lack of specificity distributes evenly among immediate prior
context, long distance prior context and no prior context. We find
that missing details that are not resolved in the the prior context
are more likely to trigger questions about the reason behind events,
“why” and “how”. Our data is accessible at http://www.cis.
upenn.edu/~nlp/corpora/lrec16spec.html
Focus Annotation of Task-based Data: AComparison of Expert and Crowd-SourcedAnnotation in a Reading Comprehension Corpus
Kordula De Kuthy, Ramon Ziai and Detmar Meurers
While the formal pragmatic concepts in information structure,
such as the focus of an utterance, are precisely defined in
theoretical linguistics and potentially very useful in conceptual
and practical terms, it has turned out to be difficult to reliably
annotate such notions in corpus data. We present a large-
scale focus annotation effort designed to overcome this problem.
Our annotation study is based on the tasked-based corpus
CREG, which consists of answers to explicitly given reading
comprehension questions. We compare focus annotation by
trained annotators with a crowd-sourcing setup making use of
untrained native speakers. Given the task context and an
annotation process incrementally making the question form and
answer type explicit, the trained annotators reach substantial
agreement for focus annotation. Interestingly, the crowd-sourcing
setup also supports high-quality annotation – for specific subtypes
of data. Finally, we turn to the question whether the relevance
of focus annotation can be extrinsically evaluated. We show
that automatic short-answer assessment significantly improves for
focus annotated data. The focus annotated CREG corpus is freely
available and constitutes the largest such resource for German.
O42 - Twitter Related AnalysisFriday, May 27, 11:45
Chairperson: Xavier Tannier Oral Session
Homing in on Twitter Users: Evaluating anEnhanced Geoparser for User Profile Locations
Beatrice Alex, Clare Llewellyn, Claire Grover, JonOberlander and Richard Tobin
Twitter-related studies often need to geo-locate Tweets or Twitter
users, identifying their real-world geographic locations. As tweet-
level geotagging remains rare, most prior work exploited tweet
content, timezone and network information to inform geolocation,
or else relied on off-the-shelf tools to geolocate users from
location information in their user profiles. However, such user
location metadata is not consistently structured, causing such
tools to fail regularly, especially if a string contains multiple
locations, or if locations are very fine-grained. We argue that
user profile location (UPL) and tweet location need to be treated
as distinct types of information from which differing inferences
can be drawn. Here, we apply geoparsing to UPLs, and
demonstrate how task performance can be improved by adapting
our Edinburgh Geoparser, which was originally developed for
processing English text. We present a detailed evaluation method
and results, including inter-coder agreement. We demonstrate
that the optimised geoparser can effectively extract and geo-
reference multiple locations at different levels of granularity with
an F1-score of around 0.90. We also illustrate how geoparsed
UPLs can be exploited for international information trade studies
and country-level sentiment analysis.
A Dataset for Detecting Stance in Tweets
Saif Mohammad, Svetlana Kiritchenko, Parinaz Sobhani,Xiaodan Zhu and Colin Cherry
We can often detect from a person’s utterances whether he/she
is in favor of or against a given target entity (a product, topic,
another person, etc.). Here for the first time we present a dataset
of tweets annotated for whether the tweeter is in favor of or against
pre-chosen targets of interest–their stance. The targets of interest
may or may not be referred to in the tweets, and they may or
may not be the target of opinion in the tweets. The data pertains
to six targets of interest commonly known and debated in the
United States. Apart from stance, the tweets are also annotated for
whether the target of interest is the target of opinion in the tweet.
The annotations were performed by crowdsourcing. Several
techniques were employed to encourage high-quality annotations
(for example, providing clear and simple instructions) and to
identify and discard poor annotations (for example, using a small
set of check questions annotated by the authors). This Stance
Dataset, which was subsequently also annotated for sentiment,
can be used to better understand the relationship between stance,
sentiment, entity relationships, and textual inference.
Emotion Analysis on Twitter: The HiddenChallenge
Luca Dini and André Bittar
In this paper, we present an experiment to detect emotions in
tweets. Unlike much previous research, we draw the important
distinction between the tasks of emotion detection in a closed
world assumption (i.e. every tweet is emotional) and the
complicated task of identifying emotional versus non-emotional
tweets. Given an apparent lack of appropriately annotated data, we
136
created two corpora for these tasks. We describe two systems, one
symbolic and one based on machine learning, which we evaluated
on our datasets. Our evaluation shows that a machine learning
classifier performs best on emotion detection, while a symbolic
approach is better for identifying relevant (i.e. emotional) tweets.
Crowdsourcing Salient Information from Newsand Tweets
Oana Inel, Tommaso Caselli and Lora Aroyo
The increasing streams of information pose challenges to both
humans and machines. On the one hand, humans need to identify
relevant information and consume only the information that lies at
their interests. On the other hand, machines need to understand
the information that is published in online data streams and
generate concise and meaningful overviews. We consider events
as prime factors to query for information and generate meaningful
context. The focus of this paper is to acquire empirical insights
for identifying salience features in tweets and news about a target
event, i.e., the event of “whaling”. We first derive a methodology
to identify such features by building up a knowledge space of the
event enriched with relevant phrases, sentiments and ranked by
their novelty. We applied this methodology on tweets and we
have performed preliminary work towards adapting it to news
articles. Our results show that crowdsourcing text relevance,
sentiments and novelty (1) can be a main step in identifying
salient information, and (2) provides a deeper and more precise
understanding of the data at hand compared to state-of-the-art
approaches.
What does this Emoji Mean? A Vector SpaceSkip-Gram Model for Twitter Emojis
Francesco Barbieri, Francesco Ronzano and HoracioSaggion
Emojis allow us to describe objects, situations and even feelings
with small images, providing a visual and quick way to
communicate. In this paper, we analyse emojis used in Twitter
with distributional semantic models. We retrieve 10 millions
tweets posted by USA users, and we build several skip gram
word embedding models by mapping in the same vectorial space
both words and emojis. We test our models with semantic
similarity experiments, comparing the output of our models with
human assessment. We also carry out an exhaustive qualitative
evaluation, showing interesting results.
O43 - SemanticsFriday, May 27, 11:45
Chairperson: James Pustejovsky Oral Session
Crossmodal Network-Based DistributionalSemantic Models
Elias Iosif and Alexandros Potamianos
Despite the recent success of distributional semantic models
(DSMs) in various semantic tasks they remain disconnected with
real-world perceptual cues since they typically rely on linguistic
features. Text data constitute the dominant source of features
for the majority of such models, although there is evidence from
cognitive science that cues from other modalities contribute to
the acquisition and representation of semantic knowledge. In
this work, we propose the crossmodal extension of a two-tier
text-based model, where semantic representations are encoded
in the first layer, while the second layer is used for computing
similarity between words. We exploit text- and image-derived
features for performing computations at each layer, as well as
various approaches for their crossmodal fusion. It is shown
that the crossmodal model performs better (from 0.68 to 0.71
correlation coefficient) than the unimodal one for the task of
similarity computation between words.
Comprehensive and Consistent PropBank LightVerb Annotation
Claire Bonial and Martha Palmer
Recent efforts have focused on expanding the annotation coverage
of PropBank from verb relations to adjective and noun relations, as
well as light verb constructions (e.g., make an offer, take a bath).
While each new relation type has presented unique annotation
challenges, ensuring consistent and comprehensive annotation
of light verb constructions has proved particularly challenging,
given that light verb constructions are semi-productive, difficult
to define, and there are often borderline cases. This research
describes the iterative process of developing PropBank annotation
guidelines for light verb constructions, the current guidelines, and
a comparison to related resources.
Inconsistency Detection in Semantic Annotation
Nora Hollenstein, Nathan Schneider and Bonnie Webber
Inconsistencies are part of any manually annotated corpus.
Automatically finding these inconsistencies and correcting them
(even manually) can increase the quality of the data. Past research
has focused mainly on detecting inconsistency in syntactic
annotation. This work explores new approaches to detecting
inconsistency in semantic annotation. Two ranking methods
137
are presented in this paper: a discrepancy ranking and an
entropy ranking. Those methods are then tested and evaluated
on multiple corpora annotated with multiword expressions and
supersense labels. The results show considerable improvements
in detecting inconsistency candidates over a random baseline.
Possible applications of methods for inconsistency detection are
improving the annotation procedure as well as the guidelines and
correcting errors in completed annotations.
Towards Comparability of Linguistic GraphBanks for Semantic Parsing
Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, DanielZeman, Silvie Cinkova, Dan Flickinger, Jan Hajic, AngelinaIvanova and Zdenka Uresova
We announce a new language resource for research on semantic
parsing, a large, carefully curated collection of semantic
dependency graphs representing multiple linguistic traditions.
This resource is called SDP 2016 and provides an update and
extension to previous versions used as Semantic Dependency
Parsing target representations in the 2014 and 2015 Semantic
Evaluation Exercises. For a common core of English text, this
third edition comprises semantic dependency graphs from four
distinct frameworks, packaged in a unified abstract format and
aligned at the sentence and token levels. SDP 2016 is the
first general release of this resource and available for licensing
from the Linguistic Data Consortium in May 2016. The data is
accompanied by an open-source SDP utility toolkit and system
results from previous contrastive parsing evaluations against these
target representations.
Event Coreference Resolution with Multi-PassSieves
Jing Lu and Vincent Ng
Multi-pass sieve approaches have been successfully applied to
entity coreference resolution and many other tasks in natural
language processing (NLP), owing in part to the ease of designing
high-precision rules for these tasks. However, the same is not
true for event coreference resolution: typically lying towards
the end of the standard information extraction pipeline, an
event coreference resolver assumes as input the noisy outputs
of its upstream components such as the trigger identification
component and the entity coreference resolution component. The
difficulty in designing high-precision rules makes it challenging
to successfully apply a multi-pass sieve approach to event
coreference resolution. In this paper, we investigate this challenge,
proposing the first multi-pass sieve approach to event coreference
resolution. When evaluated on the version of the KBP 2015
corpus available to the participants of EN Task 2 (Event Nugget
Detection and Coreference), our approach achieves an Avg F-
score of 40.32%, outperforming the best participating system by
0.67% in Avg F-score.
O44 - Speech ResourcesFriday, May 27, 11:45
Chairperson: Sophie Rosset Oral Session
Endangered Language Documentation:Bootstrapping a Chatino Speech Corpus, ForcedAligner, ASR
Malgorzata Cavar, Damir Cavar and Hilaria Cruz
This project approaches the problem of language documentation
and revitalization from a rather untraditional angle. To improve
and facilitate language documentation of endangered languages,
we attempt to use corpus linguistic methods and speech and
language technologies to reduce the time needed for transcription
and annotation of audio and video language recordings. The paper
demonstrates this approach on the example of the endangered and
seriously under-resourced variety of Eastern Chatino (CTP). We
show how initial speech corpora can be created that can facilitate
the development of speech and language technologies for under-
resourced languages by utilizing Forced Alignment tools to time
align transcriptions. Time-aligned transcriptions can be used to
train speech corpora and utilize automatic speech recognition tools
for the transcription and annotation of untranscribed data. Speech
technologies can be used to reduce the time and effort necessary
for transcription and annotation of large collections of audio
and video recordings in digital language archives, addressing the
transcription bottleneck problem that most language archives and
many under-documented languages are confronted with. This
approach can increase the availability of language resources from
low-resourced and endangered languages to speech and language
technology research and development.
The DIRHA Portuguese Corpus: A Comparisonof Home Automation Command Detection andRecognition in Simulated and Real Data.
Miguel Matos, Alberto Abad and António Serralheiro
In this paper, we describe a new corpus -named DIRHA-
L2F RealCorpus- composed of typical home automation speech
interactions in European Portuguese that has been recorded by
the INESC-ID’s Spoken Language Systems Laboratory (L2F) to
support the activities of the Distant-speech Interaction for Robust
Home Applications (DIRHA) EU-funded project. The corpus is
a multi-microphone and multi-room database of real continuous
audio sequences containing read phonetically rich sentences, read
and spontaneous keyword activation sentences, and read and
138
spontaneous home automation commands. The background noise
conditions are controlled and randomly recreated with noises
typically found in home environments. Experimental validation
on this corpus is reported in comparison with the results obtained
on a simulated corpus using a fully automated speech processing
pipeline for two fundamental automatic speech recognition tasks
of typical ’always-listening’ home-automation scenarios: system
activation and voice command recognition. Attending to results
on both corpora, the presence of overlapping voice-like noise
is shown as the main problem: simulated sequences contain
concurrent speakers that result in general in a more challenging
corpus, while real sequences performance drops drastically when
TV or radio is on.
Accuracy of Automatic Cross-Corpus EmotionLabeling for Conversational Speech CorpusCommonization
Hiroki Mori, Atsushi Nagaoka and Yoshiko Arimoto
There exists a major incompatibility in emotion labeling
framework among emotional speech corpora, that is, category-
based and dimension-based. Commonizing these requires inter-
corpus emotion labeling according to both frameworks, but
doing this by human annotators is too costly for most cases.
This paper examines the possibility of automatic cross-corpus
emotion labeling. In order to evaluate the effectiveness of
the automatic labeling, a comprehensive emotion annotation for
two conversational corpora, UUDB and OGVC, was performed.
With a state-of-the-art machine learning technique, dimensional
and categorical emotion estimation models were trained and
tested against the two corpora. For the emotion dimension
estimation, the automatic cross-corpus emotion labeling for the
different corpus was effective for the dimensions of aroused-
sleepy, dominant-submissive and interested-indifferent, showing
only slight performance degradation against the result for the same
corpus. On the other hand, the performance for the emotion
category estimation was not sufficient.
English-to-Japanese Translation vs. Dictation vs.Post-editing: Comparing Translation Modes in aMultilingual Setting
Michael Carl, Akiko Aizawa and Masaru Yamada
Speech-enabled interfaces have the potential to become one of the
most efficient and ergonomic environments for human-computer
interaction and for text production. However, not much research
has been carried out to investigate in detail the processes and
strategies involved in the different modes of text production.
This paper introduces and evaluates a corpus of more than 55
hours of English-to-Japanese user activity data that were collected
within the ENJA15 project, in which translators were observed
while writing and speaking translations (translation dictation) and
during machine translation post-editing. The transcription of
the spoken data, keyboard logging and eye-tracking data were
recorded with Translog-II, post-processed and integrated into the
CRITT Translation Process Research-DB (TPR-DB), which is
publicly available under a creative commons license. The paper
presents the ENJA15 data as part of a large multilingual Chinese,
Danish, German, Hindi and Spanish translation process data
collection of more than 760 translation sessions. It compares the
ENJA15 data with the other language pairs and reviews some of
its particularities.
Database of Mandarin Neighborhood Statistics
Karl Neergaard, Hongzhi Xu and Chu-Ren Huang
In the design of controlled experiments with language stimuli,
researchers from psycholinguistic, neurolinguistic, and related
fields, require language resources that isolate variables known
to affect language processing. This article describes a
freely available database that provides word level statistics for
words and nonwords of Mandarin, Chinese. The featured
lexical statistics include subtitle corpus frequency, phonological
neighborhood density, neighborhood frequency, and homophone
density. The accompanying word descriptors include pinyin,
ascii phonetic transcription (sampa), lexical tone, syllable
structure, dominant PoS, and syllable, segment and pinyin
lengths for each phonological word. It is designed for
researchers particularly concerned with language processing of
isolated words and made to accommodate multiple existing
hypotheses concerning the structure of the Mandarin syllable.
The database is divided into multiple files according to the
desired search criteria: 1) the syllable segmentation schema
used to calculate density measures, and 2) whether the search is
for words or nonwords. The database is open to the research
community at https://github.com/karlneergaard/
Mandarin-Neighborhood-Statistics.
P49 - Corpus Creation and Querying (2)Friday, May 27, 11:45
Chairperson: Menzo Windhouwer Poster Session
TEITOK: Text-Faithful Annotated Corpora
Maarten Janssen
TEITOK is a web-based framework for corpus creation,
annotation, and distribution, that combines textual and linguistic
annotation within a single TEI based XML document. TEITOK
provides several built-in NLP tools to automatically (pre)process
139
texts, and is highly customizable. It features multiple orthographic
transcription layers, and a wide range of user-defined token-based
annotations. For searching, TEITOK interfaces with a local CQP
server. TEITOK can handle various types of additional resources
including Facsimile images and linked audio files, making it
possible to have a combined written/spoken corpus. It also has
additional modules for PSDX syntactic annotation and several
types of stand-off annotation.
Extracting Interlinear Glossed Text from LaTeXDocuments
Mathias Schenner and Sebastian Nordhoff
We present texigt, a command-line tool for the extraction of
structured linguistic data from LaTeX source documents, and a
language resource that has been generated using this tool: a corpus
of interlinear glossed text (IGT) extracted from open access books
published by Language Science Press. Extracted examples are
represented in a simple XML format that is easy to process and
can be used to validate certain aspects of interlinear glossed text.
The main challenge involved is the parsing of TeX and LaTeX
documents. We review why this task is impossible in general
and how the texhs Haskell library uses a layered architecture and
selective early evaluation (expansion) during lexing and parsing
in order to provide access to structured representations of LaTeX
documents at several levels. In particular, its parsing modules
generate an abstract syntax tree for LaTeX documents after
expansion of all user-defined macros and lexer-level commands
that serves as an ideal interface for the extraction of interlinear
glossed text by texigt. This architecture can easily be adapted to
extract other types of linguistic data structures from LaTeX source
documents.
Interoperability of Annotation Schemes: Usingthe Pepper Framework to Display AWADocuments in the ANNIS Interface
Talvany Carlotto, Zuhaitz Beloki, Xabier Artola and AitorSoroa
Natural language processing applications are frequently integrated
to solve complex linguistic problems, but the lack of
interoperability between these tools tends to be one of the
main issues found in that process. That is often caused by
the different linguistic formats used across the applications,
which leads to attempts to both establish standard formats to
represent linguistic information and to create conversion tools
to facilitate this integration. Pepper is an example of the latter,
as a framework that helps the conversion between different
linguistic annotation formats. In this paper, we describe the
use of Pepper to convert a corpus linguistically annotated by
the annotation scheme AWA into the relANNIS format, with the
ultimate goal of interacting with AWA documents through the
ANNIS interface. The experiment converted 40 megabytes of
AWA documents, allowed their use on the ANNIS interface, and
involved making architectural decisions during the mapping from
AWA into relANNIS using Pepper. The main issues faced during
this process were due to technical issues mainly caused by the
integration of the different systems and projects, namely AWA,
Pepper and ANNIS.
SPLIT: Smart Preprocessing (Quasi) LanguageIndependent Tool
Mohamed Al-Badrashiny, Arfath Pasha, Mona Diab, NizarHabash, Owen Rambow, Wael Salloum and Ramy Eskander
Text preprocessing is an important and necessary task for all NLP
applications. A simple variation in any preprocessing step may
drastically affect the final results. Moreover replicability and
comparability, as much as feasible, is one of the goals of our
scientific enterprise, thus building systems that can ensure the
consistency in our various pipelines would contribute significantly
to our goals. The problem has become quite pronounced with the
abundance of NLP tools becoming more and more available yet
with different levels of specifications. In this paper, we present
a dynamic unified preprocessing framework and tool, SPLIT,
that is highly configurable based on user requirements which
serves as a preprocessing tool for several tools at once. SPLIT
aims to standardize the implementations of the most important
preprocessing steps by allowing for a unified API that could
be exchanged across different researchers to ensure complete
transparency in replication. The user is able to select the required
preprocessing tasks among a long list of preprocessing steps. The
user is also able to specify the order of execution which in turn
affects the final preprocessing output.
ArchiMob - A Corpus of Spoken Swiss German
Tanja Samardzic, Yves Scherrer and Elvira Glaser
Swiss dialects of German are, unlike most dialects of well
standardised languages, widely used in everyday communication.
Despite this fact, automatic processing of Swiss German is still a
considerable challenge due to the fact that it is mostly a spoken
variety rarely recorded and that it is subject to considerable
regional variation. This paper presents a freely available general-
purpose corpus of spoken Swiss German suitable for linguistic
research, but also for training automatic tools. The corpus is
a result of a long design process, intensive manual work and
140
specially adapted computational processing. We first describe how
the documents were transcribed, segmented and aligned with the
sound source, and how inconsistent transcriptions were unified
through an additional normalisation layer. We then present a
bootstrapping approach to automatic normalisation using different
machine-translation-inspired methods. Furthermore, we evaluate
the performance of part-of-speech taggers on our data and show
how the same bootstrapping approach improves part-of-speech
tagging by 10% over four rounds. Finally, we present the
modalities of access of the corpus as well as the data format.
Word Segmentation for Akkadian Cuneiform
Timo Homburg and Christian Chiarcos
We present experiments on word segmentation for Akkadian
cuneiform, an ancient writing system and a language used for
about 3 millennia in the ancient Near East. To our best knowledge,
this is the first study of this kind applied to either the Akkadian
language or the cuneiform writing system. As a logosyllabic
writing system, cuneiform structurally resembles Eastern Asian
writing systems, so, we employ word segmentation algorithms
originally developed for Chinese and Japanese. We describe
results of rule-based algorithms, dictionary-based algorithms,
statistical and machine learning approaches. Our results may
indicate possible promising steps in cuneiform word segmentation
that can create and improve natural language processing in this
area.
Controlled Propagation of Concept Annotationsin Textual Corpora
Cyril Grouin
In this paper, we presented the annotation propagation tool
we designed to be used in conjunction with the BRAT rapid
annotation tool. We designed two experiments to annotate a
corpus of 60 files, first not using our tool, second using our
propagation tool. We evaluated the annotation time and the quality
of annotations. We shown that using the annotation propagation
tool reduces by 31.7% the time spent to annotate the corpus with
a better quality of results.
Graphical Annotation for Syntax-SemanticsMapping
Koiti Hasida
A potential work item (PWI) for ISO standard (MAP) about
linguistic annotation concerning syntax-semantics mapping is
discussed. MAP is a framework for graphical linguistic annotation
to specify a mapping (set of combinations) between possible
syntactic and semantic structures of the annotated linguistic
data. Just like a UML diagram, a MAP diagram is formal,
in the sense that it accurately specifies such a mapping. MAP
provides a diagrammatic sort of concrete syntax for linguistic
annotation far easier to understand than textual concrete syntax
such as in XML, so that it could better facilitate collaborations
among people involved in research, standardization, and practical
use of linguistic data. MAP deals with syntactic structures
including dependencies, coordinations, ellipses, transsentential
constructions, and so on. Semantic structures treated by MAP
are argument structures, scopes, coreferences, anaphora, discourse
relations, dialogue acts, and so forth. In order to simplify explicit
annotations, MAP allows partial descriptions, and assumes a few
general rules on correspondence between syntactic and semantic
compositions.
EDISON: Feature Extraction for NLP, Simplified
Mark Sammons, Christos Christodoulopoulos, ParisaKordjamshidi, Daniel Khashabi, Vivek Srikumar and DanRoth
When designing Natural Language Processing (NLP) applications
that use Machine Learning (ML) techniques, feature extraction
becomes a significant part of the development effort, whether
developing a new application or attempting to reproduce results
reported for existing NLP tasks. We present EDISON, a Java
library of feature generation functions used in a suite of state-of-
the-art NLP tools, based on a set of generic NLP data structures.
These feature extractors populate simple data structures encoding
the extracted features, which the package can also serialize to
an intuitive JSON file format that can be easily mapped to
formats used by ML packages. EDISON can also be used
programmatically with JVM-based (Java/Scala) NLP software to
provide the feature extractor input. The collection of feature
extractors is organised hierarchically and a simple search interface
is provided. In this paper we include examples that demonstrate
the versatility and ease-of-use of the EDISON feature extraction
suite to show that this can significantly reduce the time spent
by developers on feature extraction design for NLP systems.
The library is publicly hosted at https://github.com/
IllinoisCogComp/illinois-cogcomp-nlp/, and we
hope that other NLP researchers will contribute to the set of
feature extractors. In this way, the community can help simplify
reproduction of published results and the integration of ideas
from diverse sources when developing new and improved NLP
applications.
141
P50 - Document Classification and TextCategorisation (2)Friday, May 27, 11:45
Chairperson: Thierry Hamon Poster Session
MADAD: A Readability Annotation Tool forArabic Text
Nora Al-Twairesh, Abeer Al-Dayel, Hend Al-Khalifa,Maha Al-Yahya, Sinaa Alageel, Nora Abanmy and NoufAlShenaifi
This paper introduces MADAD, a general-purpose annotation
tool for Arabic text with focus on readability annotation. This
tool will help in overcoming the problem of lack of Arabic
readability training data by providing an online environment to
collect readability assessments on various kinds of corpora. Also
the tool supports a broad range of annotation tasks for various
linguistic and semantic phenomena by allowing users to create
their customized annotation schemes. MADAD is a web-based
tool, accessible through any web browser; the main features that
distinguish MADAD are its flexibility, portability, customizability
and its bilingual interface (Arabic/English).
Modeling Language Change in HistoricalCorpora: The Case of Portuguese
Marcos Zampieri, Shervin Malmasi and Mark Dras
This paper presents a number of experiments to model changes in
a historical Portuguese corpus composed of literary texts for the
purpose of temporal text classification. Algorithms were trained
to classify texts with respect to their publication date taking
into account lexical variation represented as word n-grams, and
morphosyntactic variation represented by part-of-speech (POS)
distribution. We report results of 99.8% accuracy using word
unigram features with a Support Vector Machines classifier to
predict the publication date of documents in time intervals of both
one century and half a century. A feature analysis is performed
to investigate the most informative features for this task and how
they are linked to language change.
“He Said She Said” – a Male/Female Corpus ofPolish
Filip Gralinski, Łukasz Borchmann and Piotr Wierzchon
Gender differences in language use have long been of interest
in linguistics. The task of automatic gender attribution has
been considered in computational linguistics as well. Most
research of this type is done using (usually English) texts with
authorship metadata. In this paper, we propose a new method
of male/female corpus creation based on gender-specific first-
person expressions. The method was applied on CommonCrawl
Web corpus for Polish (language, in which gender-revealing first-
person expressions are particularly frequent) to yield a large
(780M words) and varied collection of men’s and women’s texts.
The whole procedure for building the corpus and filtering out
unwanted texts is described in the present paper. The quality
check was done on a random sample of the corpus to make sure
that the majority (84%) of texts are correctly attributed, natural
texts. Some preliminary (socio)linguistic insights (websites and
words frequently occurring in male/female fragments) are given
as well.
Cohere: A Toolkit for Local CoherenceKarin Sim Smith, Wilker Aziz and Lucia Specia
We describe COHERE, our coherence toolkit which incorporates
various complementary models for capturing and measuring
different aspects of text coherence. In addition to the traditional
entity grid model (Lapata, 2005) and graph-based metric
(Guinaudeau and Strube, 2013), we provide an implementation of
a state-of-the-art syntax-based model (Louis and Nenkova, 2012),
as well as an adaptation of this model which shows significant
performance improvements in our experiments. We benchmark
these models using the standard setting for text coherence:
original documents and versions of the document with sentences
in shuffled order.
Multi-label Annotation in Scientific Articles - TheMulti-label Cancer Risk Assessment CorpusJames Ravenscroft, Anika Oellrich, Shyamasree Saha andMaria Liakata
With the constant growth of the scientific literature, automated
processes to enable access to its contents are increasingly
in demand. Several functional discourse annotation schemes
have been proposed to facilitate information extraction and
summarisation from scientific articles, the most well known being
argumentative zoning. Core Scientific concepts (CoreSC) is a
three layered fine-grained annotation scheme providing content-
based annotations at the sentence level and has been used
to index, extract and summarise scientific publications in the
biomedical literature. A previously developed CoreSC corpus
on which existing automated tools have been trained contains
a single annotation for each sentence. However, it is the case
that more than one CoreSC concept can appear in the same
sentence. Here, we present the Multi-CoreSC CRA corpus, a
text corpus specific to the domain of cancer risk assessment
(CRA), consisting of 50 full text papers, each of which contains
sentences annotated with one or more CoreSCs. The full text
papers have been annotated by three biology experts. We
present several inter-annotator agreement measures appropriate
142
for multi-label annotation assessment. Employing several inter-
annotator agreement measures, we were able to identify the most
reliable annotator and we built a harmonised consensus (gold
standard) from the three different annotators, while also taking
concept priority (as specified in the guidelines) into account. We
also show that the new Multi-CoreSC CRA corpus allows us
to improve performance in the recognition of CoreSCs. The
updated guidelines, the multi-label CoreSC CRA corpus and other
relevant, related materials are available at the time of publication
at http://www.sapientaproject.com/.
Detecting Expressions of Blame or Praise in Text
Udochukwu Orizu and Yulan He
The growth of social networking platforms has drawn a lot of
attentions to the need for social computing. Social computing
utilises human insights for computational tasks as well as design
of systems that support social behaviours and interactions. One
of the key aspects of social computing is the ability to attribute
responsibility such as blame or praise to social events. This
ability helps an intelligent entity account and understand other
intelligent entities’ social behaviours, and enriches both the social
functionalities and cognitive aspects of intelligent agents. In
this paper, we present an approach with a model for blame and
praise detection in text. We build our model based on various
theories of blame and include in our model features used by
humans determining judgment such as moral agent causality,
foreknowledge, intentionality and coercion. An annotated corpus
has been created for the task of blame and praise detection from
text. The experimental results show that while our model gives
similar results compared to supervised classifiers on classifying
text as blame, praise or others, it outperforms supervised
classifiers on more finer-grained classification of determining the
direction of blame and praise, i.e., self-blame, blame-others, self-
praise or praise-others, despite not using labelled training data.
Evaluating Unsupervised Dutch WordEmbeddings as a Linguistic Resource
Stephan Tulkens, Chris Emmery and Walter Daelemans
Word embeddings have recently seen a strong increase in interest
as a result of strong performance gains on a variety of tasks.
However, most of this research also underlined the importance of
benchmark datasets, and the difficulty of constructing these for
a variety of language-specific tasks. Still, many of the datasets
used in these tasks could prove to be fruitful linguistic resources,
allowing for unique observations into language use and variability.
In this paper we demonstrate the performance of multiple types
of embeddings, created with both count and prediction-based
architectures on a variety of corpora, in two language-specific
tasks: relation evaluation, and dialect identification. For the
latter, we compare unsupervised methods with a traditional,
hand-crafted dictionary. With this research, we provide the
embeddings themselves, the relation evaluation task benchmark
for use in further research, and demonstrate how the benchmarked
embeddings prove a useful unsupervised linguistic resource,
effectively used in a downstream task.
SatiricLR: a Language Resource of Satirical NewsArticles
Alice Frain and Sander Wubben
In this paper we introduce the Satirical Language Resource:
a dataset containing a balanced collection of satirical and non
satirical news texts from various domains. This is the first dataset
of this magnitude and scope in the domain of satire. We envision
this dataset will facilitate studies on various aspects of of sat- ire
in news articles. We test the viability of our data on the task of
classification of satire.
P51 - Multilingual CorporaFriday, May 27, 11:45
Chairperson: Penny Labropoulou Poster Session
WIKIPARQ: A Tabulated Wikipedia ResourceUsing the Parquet Format
Marcus Klang and Pierre Nugues
Wikipedia has become one of the most popular resources in
natural language processing and it is used in quantities of
applications. However, Wikipedia requires a substantial pre-
processing step before it can be used. For instance, its set of
nonstandardized annotations, referred to as the wiki markup, is
language-dependent and needs specific parsers from language
to language, for English, French, Italian, etc. In addition, the
intricacies of the different Wikipedia resources: main article
text, categories, wikidata, infoboxes, scattered into the article
document or in different files make it difficult to have global view
of this outstanding resource. In this paper, we describe WikiParq,
a unified format based on the Parquet standard to tabulate and
package the Wikipedia corpora. In combination with Spark, a
map-reduce computing framework, and the SQL query language,
WikiParq makes it much easier to write database queries to extract
specific information or subcorpora from Wikipedia, such as all
the first paragraphs of the articles in French, or all the articles
on persons in Spanish, or all the articles on persons that have
versions in French, English, and Spanish. WikiParq is available
in six language versions and is potentially extendible to all the
languages of Wikipedia. The WikiParq files are downloadable as
143
tarball archives from this location: http://semantica.cs.
lth.se/wikiparq/.
EN-ES-CS: An English-Spanish Code-SwitchingTwitter Corpus for Multilingual SentimentAnalysis
David Vilares, Miguel A. Alonso and Carlos Gómez-Rodríguez
Code-switching texts are those that contain terms in two or
more different languages, and they appear increasingly often in
social media. The aim of this paper is to provide a resource
to the research community to evaluate the performance of
sentiment classification techniques on this complex multilingual
environment, proposing an English-Spanish corpus of tweets with
code-switching (EN-ES-CS CORPUS). The tweets are labeled
according to two well-known criteria used for this purpose:
SentiStrength and a trinary scale (positive, neutral and negative
categories). Preliminary work on the resource is already done,
providing a set of baselines for the research community.
SemRelData – Multilingual ContextualAnnotation of Semantic Relations betweenNominals: Dataset and Guidelines
Darina Benikova and Chris Biemann
Semantic relations play an important role in linguistic knowledge
representation. Although their role is relevant in the context of
written text, there is no approach or dataset that makes use of
contextuality of classic semantic relations beyond the boundary
of one sentence. We present the SemRelData dataset that contains
annotations of semantic relations between nominals in the context
of one paragraph. To be able to analyse the universality of this
context notion, the annotation was performed on a multi-lingual
and multi-genre corpus. To evaluate the dataset, it is compared
to large, manually created knowledge resources in the respective
languages. The comparison shows that knowledge bases not
only have coverage gaps; they also do not account for semantic
relations that are manifested in particular contexts only, yet still
play an important role for text cohesion.
A Multilingual, Multi-style and Multi-granularityDataset for Cross-language Textual SimilarityDetection
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier andDidier Schwab
In this paper we describe our effort to create a dataset for
the evaluation of cross-language textual similarity detection.
We present preexisting corpora and their limits and we
explain the various gathered resources to overcome these limits
and build our enriched dataset. The proposed dataset is
multilingual, includes cross-language alignment for different
granularities (from chunk to document), is based on both
parallel and comparable corpora and contains human and
machine translated texts. Moreover, it includes texts written
by multiple types of authors (from average to professionals).
With the obtained dataset, we conduct a systematic and
rigorous evaluation of several state-of-the-art cross-language
textual similarity detection methods. The evaluation results
are reviewed and discussed. Finally, dataset and scripts are
made publicly available on GitHub: http://github.com/
FerreroJeremy/Cross-Language-Dataset.
An Arabic-Moroccan Darija Code-SwitchedCorpusYounes Samih and Wolfgang Maier
In this paper, we describe our effort in the development and
annotation of a large scale corpus containing code-switched data.
Until recently, very limited effort has been devoted to develop
computational approaches or even basic linguistic resources to
support research into the processing of Moroccan Darija.
Standard Test Collection for English-PersianCross-Lingual Word Sense DisambiguationNavid Rekabsaz, Serwah Sabetghadam, Mihai Lupu, LindaAndersson and Allan Hanbury
In this paper, we address the shortage of evaluation benchmarks
on Persian (Farsi) language by creating and making available a
new benchmark for English to Persian Cross Lingual Word Sense
Disambiguation (CL-WSD). In creating the benchmark, we follow
the format of the SemEval 2013 CL-WSD task, such that the
introduced tools of the task can also be applied on the benchmark.
In fact, the new benchmark extends the SemEval-2013 CL-WSD
task to Persian language.
FREME: Multilingual Semantic Enrichment withLinked Data and Language TechnologiesMilan Dojchinovski, Felix Sasaki, Tatjana Gornostaja,Sebastian Hellmann, Erik Mannens, Frank Salliau, MicheleOsella, Phil Ritchie, Giannis Stoitsis, Kevin Koidl, MarkusAckermann and Nilesh Chakraborty
In the recent years, Linked Data and Language Technology
solutions gained popularity. Nevertheless, their coupling in real-
world business is limited due to several issues. Existing products
and services are developed for a particular domain, can be used
only in combination with already integrated datasets or their
language coverage is limited. In this paper, we present an
innovative solution FREME - an open framework of e-Services
for multilingual and semantic enrichment of digital content. The
framework integrates six interoperable e-Services. We describe
the core features of each e-Service and illustrate their usage in
144
the context of four business cases: i) authoring and publishing; ii)
translation and localisation; iii) cross-lingual access to data; and
iv) personalised Web content recommendations. Business cases
drive the design and development of the framework.
Improving Bilingual Terminology Extraction fromComparable Corpora via Multiple Word-SpaceModels
Amir Hazem and Emmanuel Morin
There is a rich flora of word space models that have proven their
efficiency in many different applications including information
retrieval (Dumais, 1988), word sense disambiguation (Schutze,
1992), various semantic knowledge tests (lund, 1995; Karlgren,
2001), and text categorization (Sahlgren, 2005). Based on
the assumption that each model captures some aspects of word
meanings and provides its own empirical evidence, we present
in this paper a systematic exploration of the principal corpus-
based word space models for bilingual terminology extraction
from comparable corpora. We find that, once we have identified
the best procedures, a very simple combination approach leads to
significant improvements compared to individual models.
MultiVec: a Multilingual and MultilevelRepresentation Learning Toolkit for NLP
Alexandre Berard, Christophe Servan, Olivier Pietquin andLaurent Besacier
We present MultiVec, a new toolkit for computing continuous
representations for text at different granularity levels (word-level
or sequences of words). MultiVec includes word2vec’s features,
paragraph vector (batch and online) and bivec for bilingual
distributed representations. MultiVec also includes different
distance measures between words and sequences of words. The
toolkit is written in C++ and is aimed at being fast (in the
same order of magnitude as word2vec), easy to use, and easy to
extend. It has been evaluated on several NLP tasks: the analogical
reasoning task, sentiment analysis, and crosslingual document
classification.
Creation of comparable corpora forEnglish-Urdu, Arabic, Persian
Murad Abouammoh, Kashif Shah and Ahmet Aker
Statistical Machine Translation (SMT) relies on the availability of
rich parallel corpora. However, in the case of under-resourced
languages or some specific domains, parallel corpora are not
readily available. This leads to under-performing machine
translation systems in those sparse data settings. To overcome
the low availability of parallel resources the machine translation
community has recognized the potential of using comparable
resources as training data. However, most efforts have been
related to European languages and less in middle-east languages.
In this study, we report comparable corpora created from news
articles for the pair English –Arabic, Persian, Urdu languages.
The data has been collected over a period of a year, entails Arabic,
Persian and Urdu languages. Furthermore using the English as
a pivot language, comparable corpora that involve more than one
language can be created, e.g. English- Arabic - Persian, English
- Arabic - Urdu, English – Urdu - Persian, etc. Upon request the
data can be provided for research purposes.
A Corpus of Native, Non-native and TranslatedTexts
Sergiu Nisioi, Ella Rabinovich, Liviu P. Dinu and ShulyWintner
We describe a monolingual English corpus of original and
(human) translated texts, with an accurate annotation of speaker
properties, including the original language of the utterances and
the speaker’s country of origin. We thus obtain three sub-
corpora of texts reflecting native English, non-native English,
and English translated from a variety of European languages.
This dataset will facilitate the investigation of similarities and
differences between these kinds of sub-languages. Moreover,
it will facilitate a unified comparative study of translations and
language produced by (highly fluent) non-native speakers, two
closely-related phenomena that have only been studied in isolation
so far.
Orthographic and MorphologicalCorrespondences between Related SlavicLanguages as a Base for Modeling of MutualIntelligibility
Andrea Fischer, Klara Jagrova, Irina Stenger, TaniaAvgustinova, Dietrich Klakow and Roland Marti
In an intercomprehension scenario, typically a native speaker
of language L1 is confronted with output from an unknown,
but related language L2. In this setting, the degree to which
the receiver recognizes the unfamiliar words greatly determines
communicative success. Despite exhibiting great string-level
differences, cognates may be recognized very successfully if
the receiver is aware of regular correspondences which allow to
transform the unknown word into its familiar form. Modeling
L1-L2 intercomprehension then requires the identification of all
the regular correspondences between languages L1 and L2. We
here present a set of linguistic orthographic correspondences
manually compiled from comparative linguistics literature along
with a set of statistically-inferred suggestions for correspondence
rules. In order to do statistical inference, we followed the
145
Minimum Description Length principle, which proposes to
choose those rules which are most effective at describing the
data. Our statistical model was able to reproduce most of our
linguistic correspondences (88.5% for Czech-Polish and 75.7%
for Bulgarian-Russian) and furthermore allowed to easily identify
many more non-trivial correspondences which also cover aspects
of morphology.
Axolotl: a Web Accessible Parallel Corpus forSpanish-Nahuatl
Ximena Gutierrez-Vasques, Gerardo Sierra and IsaacHernandez Pompa
This paper describes the project called Axolotl which comprises a
Spanish-Nahuatl parallel corpus and its search interface. Spanish
and Nahuatl are distant languages spoken in the same country.
Due to the scarcity of digital resources, we describe the several
problems that arose when compiling this corpus: most of our
sources were non-digital books, we faced errors when digitizing
the sources and there were difficulties in the sentence alignment
process, just to mention some. The documents of the parallel
corpus are not homogeneous, they were extracted from different
sources, there is dialectal, diachronical, and orthographical
variation. Additionally, we present a web search interface that
allows to make queries through the whole parallel corpus, the
system is capable to retrieve the parallel fragments that contain
a word or phrase searched by a user in any of the languages. To
our knowledge, this is the first Spanish-Nahuatl public available
digital parallel corpus. We think that this resource can be useful
to develop language technologies and linguistic studies for this
language pair.
A Turkish-German Code-Switching Corpus
Özlem Çetinoglu
Bilingual communities often alternate between languages both
in spoken and written communication. One such community,
Germany residents of Turkish origin produce Turkish-German
code-switching, by heavily mixing two languages at discourse,
sentence, or word level. Code-switching in general, and Turkish-
German code-switching in particular, has been studied for a long
time from a linguistic perspective. Yet resources to study them
from a more computational perspective are limited due to either
small size or licence issues. In this work we contribute the solution
of this problem with a corpus. We present a Turkish-German code-
switching corpus which consists of 1029 tweets, with a majority
of intra-sentential switches. We share different type of code-
switching we have observed in our collection and describe our
processing steps. The first step is data collection and filtering.
This is followed by manual tokenisation and normalisation. And
finally, we annotate data with word-level language identification
information. The resulting corpus is available for research
purposes.
Introducing the LCC Metaphor Datasets
Michael Mohler, Mary Brunson, Bryan Rink and MarcTomlinson
In this work, we present the Language Computer Corporation
(LCC) annotated metaphor datasets, which represent the largest
and most comprehensive resource for metaphor research to date.
These datasets were produced over the course of three years by
a staff of nine annotators working in four languages (English,
Spanish, Russian, and Farsi). As part of these datasets, we
provide (1) metaphoricity ratings for within-sentence word pairs
on a four-point scale, (2) scored links to our repository of 114
source concept domains and 32 target concept domains, and
(3) ratings for the affective polarity and intensity of each pair.
Altogether, we provide 188,741 annotations in English (for 80,100
pairs), 159,915 annotations in Spanish (for 63,188 pairs), 99,740
annotations in Russian (for 44,632 pairs), and 137,186 annotations
in Farsi (for 57,239 pairs). In addition, we are providing a large
set of likely metaphors which have been independently extracted
by our two state-of-the-art metaphor detection systems but which
have not been analyzed by our team of annotators.
Creating a Large Multi-Layered RepresentationalRepository of Linguistic Code Switched ArabicData
Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari,Fahad AlGhamdi, Nada AlMarwani and Mohamed Al-Badrashiny
We present our effort to create a large Multi-Layered
representational repository of Linguistic Code-Switched Arabic
data. The process involves developing clear annotation
standards and Guidelines, streamlining the annotation process,
and implementing quality control measures. We used two main
protocols for annotation: in-lab gold annotations and crowd
sourcing annotations. We developed a web-based annotation tool
to facilitate the management of the annotation process. The
current version of the repository contains a total of 886,252 tokens
that are tagged into one of sixteen code-switching tags. The data
exhibits code switching between Modern Standard Arabic and
Egyptian Dialectal Arabic representing three data genres: Tweets,
commentaries, and discussion fora. The overall Inter-Annotator
Agreement is 93.1%.
146
Modelling a Parallel Corpus of French andFrench Belgian Sign Language
Laurence Meurant, Maxime Gobert and Anthony Cleve
The overarching objective underlying this research is to develop
an online tool, based on a parallel corpus of French Belgian
Sign Language (LSFB) and written Belgian French. This tool
is aimed to assist various set of tasks related to the comparison
of LSFB and French, to the benefit of general users as well as
teachers in bilingual schools, translators and interpreters, as well
as linguists. These tasks include (1) the comprehension of LSFB
or French texts, (2) the production of LSFB or French texts, (3)
the translation between LSFB and French in both directions and
(4) the contrastive analysis of these languages. The first step
of investigation aims at creating an unidirectional French-LSFB
concordancer, able to align a one- or multiple-word expression
from the French translated text with its corresponding expressions
in the videotaped LSFB productions. We aim at testing the
efficiency of this concordancer for the extraction of a dictionary of
meanings in context. In this paper, we will present the modelling
of the different data sources at our disposal and specifically the
way they interact with one another.
Building the Macedonian-Croatian ParallelCorpus
Ines Cebovic and Marko Tadic
In this paper we present the newly created parallel corpus of
two under-resourced languages, namely, Macedonian-Croatian
Parallel Corpus (mk-hr_ pcorp) that has been collected during
2015 at the Faculty of Humanities and Social Sciences, University
of Zagreb. The mk-hr_ pcorp is a unidirectional (mkightarrowhr)
parallel corpus composed of synchronic fictional prose texts
received already in digital form with over 500 Kw in each
language. The corpus was sentence segmented and provides
39,735 aligned sentences. The alignment was done automatically
and then post-corrected manually. The alignments order was
shuffled and this enabled the corpus to be available under CC-
BY license through META-SHARE. However, this prevents the
research in language units over the sentence level.
Two Years of Aranea: Increasing Counts andTuning the Pipeline
Vladimír Benko
The Aranea Project is targeted at creation of a family of Gigaword
web-corpora for a dozen of languages that could be used for
teaching language- and linguistics-related subjects at Slovak
universities, as well as for research purposes in various areas of
linguistics. All corpora are being built according to a standard
methodology and using the same set of tools for processing
and annotation, which – together with their standard size and–
makes them also a valuable resource for translators and contrastive
studies. All our corpora are freely available either via a web
interface or in a source form in an annotated vertical format.
Quantitative Analysis of Gazes and GroundingActs in L1 and L2 Conversations
Ichiro Umata, Koki Ijuin, Mitsuru Ishida, Moe Takeuchiand Seiichi Yamamoto
The listener’s gazing activities during utterances were analyzed
in a face-to-face three-party conversation setting. The function
of each utterance was categorized according to the Grounding
Acts defined by Traum (Traum, 1994) so that gazes during
utterances could be analyzed from the viewpoint of grounding
in communication (Clark, 1996). Quantitative analysis showed
that the listeners were gazing at the speakers more in the second
language (L2) conversation than in the native language (L1)
conversation during the utterances that added new pieces of
information, suggesting that they are using visual information
to compensate for their lack of linguistic proficiency in L2
conversation.
Multi-language Speech Collection for NIST LRE
Karen Jones, Stephanie Strassel, Kevin Walker, David Graffand Jonathan Wright
The Multi-language Speech (MLS) Corpus supports NIST’s
Language Recognition Evaluation series by providing new
conversational telephone speech and broadcast narrowband data
in 20 languages/dialects. The corpus was built with the intention
of testing system performance in the matter of distinguishing
closely related or confusable linguistic varieties, and careful
manual auditing of collected data was an important aspect of
this work. This paper lists the specific data requirements for
the collection and provides both a commentary on the rationale
for those requirements as well as an outline of the various steps
taken to ensure all goals were met as specified. LDC conducted
a large-scale recruitment effort involving the implementation of
candidate assessment and interview techniques suitable for hiring
a large contingent of telecommuting workers, and this recruitment
effort is discussed in detail. We also describe the telephone
and broadcast collection infrastructure and protocols, and provide
details of the steps taken to pre-process collected data prior to
auditing. Finally, annotation training, procedures and outcomes
are presented in detail.
147
P52 - Part of Speech Tagging (2)Friday, May 27, 11:45
Chairperson: Piotr Banski Poster Session
FlexTag: A Highly Flexible PoS TaggingFramework
Torsten Zesch and Tobias Horsmann
We present FlexTag, a highly flexible PoS tagging framework. In
contrast to monolithic implementations that can only be retrained
but not adapted otherwise, FlexTag enables users to modify the
feature space and the classification algorithm. Thus, FlexTag
makes it easy to quickly develop custom-made taggers exactly
fitting the research problem.
New Inflectional Lexicons and Training Corporafor Improved Morphosyntactic Annotation ofCroatian and Serbian
Nikola Ljubešic, Filip Klubicka, Željko Agic and Ivo-PavaoJazbec
In this paper we present newly developed inflectional lexcions
and manually annotated corpora of Croatian and Serbian. We
introduce hrLex and srLex - two freely available inflectional
lexicons of Croatian and Serbian - and describe the process of
building these lexicons, supported by supervised machine learning
techniques for lemma and paradigm prediction. Furthermore,
we introduce hr500k, a manually annotated corpus of Croatian,
500 thousand tokens in size. We showcase the three newly
developed resources on the task of morphosyntactic annotation of
both languages by using a recently developed CRF tagger. We
achieve best results yet reported on the task for both languages,
beating the HunPos baseline trained on the same datasets by a
wide margin.
TGermaCorp – A (Digital) Humanities Resourcefor (Computational) Linguistics
Andy Luecking, Armin Hoenen and Alexander Mehler
TGermaCorp is a German text corpus whose primary sources
are collected from German literature texts which date from
the sixteenth century to the present. The corpus is intended
to represent its target language (German) in syntactic, lexical,
stylistic and chronological diversity. For this purpose, it is hand-
annotated on several linguistic layers, including POS, lemma,
named entities, multiword expressions, clauses, sentences and
paragraphs. In order to introduce TGermaCorp in comparison to
more homogeneous corpora of contemporary everyday language,
quantitative assessments of syntactic and lexical diversity are
provided. In this respect, TGermaCorp contributes to establishing
characterising features for resource descriptions, which is needed
for keeping track of a meaningful comparison of the ever-growing
number of natural language resources. The assessments confirm
the special role of proper names, whose propagation in text may
influence lexical and syntactic diversity measures in rather trivial
ways. TGermaCorp will be made available via hucompute.org.
The hunvec framework for NN-CRF-basedsequential tagging
Katalin Pajkossy and Attila Zséder
In this work we present the open source hunvec framework
for sequential tagging, built upon Theano and Pylearn2. The
underlying statistical model, which connects linear CRF-s with
neural networks, was used by Collobert and co-workers, and
several other researchers. For demonstrating the flexibility of
our tool, we describe a set of experiments on part-of-speech
and named-entity-recognition tasks, using English and Hungarian
datasets, where we modify both model and training parameters,
and illustrate the usage of custom features. Model parameters
we experiment with affect the vectorial word representations used
by the model; we apply different word vector initializations,
defined by Word2vec and GloVe embeddings and enrich the
representation of words by vectors assigned trigram features.
We extend training methods by using their regularized (l2 and
dropout) version. When testing our framework on a Hungarian
named entity corpus, we find that its performance reaches the
best published results on this dataset, with no need for language-
specific feature engineering. Our code is available at http:
//github.com/zseder/hunvec.
A Large Scale Corpus of Gulf Arabic
Salam Khalifa, Nizar Habash, Dana Abdulrahim and SaraHassan
Most Arabic natural language processing tools and resources
are developed to serve Modern Standard Arabic (MSA), which
is the official written language in the Arab World. Some
Dialectal Arabic varieties, notably Egyptian Arabic, have received
some attention lately and have a growing collection of resources
that include annotated corpora and morphological analyzers and
taggers. Gulf Arabic, however, lags behind in that respect. In
this paper, we present the Gumar Corpus, a large-scale corpus of
Gulf Arabic consisting of 110 million words from 1,200 forum
novels. We annotate the corpus for sub-dialect information at the
document level. We also present results of a preliminary study
in the morphological annotation of Gulf Arabic which includes
developing guidelines for a conventional orthography. The text
148
of the corpus is publicly browsable through a web interface we
developed for it.
UDPipe: Trainable Pipeline for ProcessingCoNLL-U Files Performing Tokenization,Morphological Analysis, POS Tagging andParsing
Milan Straka, Jan Hajic and Jana Straková
Automatic natural language processing of large texts often
presents recurring challenges in multiple languages: even for
most advanced tasks, the texts are first processed by basic
processing steps – from tokenization to parsing. We present
an extremely simple-to-use tool consisting of one binary and
one model (per language), which performs these tasks for
multiple languages without the need for any other external
data. UDPipe, a pipeline processing CoNLL-U-formatted files,
performs tokenization, morphological analysis, part-of-speech
tagging, lemmatization and dependency parsing for nearly all
treebanks of Universal Dependencies 1.2 (namely, the whole
pipeline is currently available for 32 out of 37 treebanks). In
addition, the pipeline is easily trainable with training data in
CoNLL-U format (and in some cases also with additional raw
corpora) and requires minimal linguistic knowledge on the users’
part. The training code is also released.
Exploiting Arabic Diacritization for High QualityAutomatic Annotation
Nizar Habash, Anas Shahrour and Muhamed Al-Khalil
We present a novel technique for Arabic morphological
annotation. The technique utilizes diacritization to produce
morphological annotations of quality comparable to human
annotators. Although Arabic text is generally written without
diacritics, diacritization is already available for large corpora
of Arabic text in several genres. Furthermore, diacritization
can be generated at a low cost for new text as it does not
require specialized training beyond what educated Arabic typists
know. The basic approach is to enrich the input to a state-
of-the-art Arabic morphological analyzer with word diacritics
(full or partial) to enhance its performance. When applied to
fully diacritized text, our approach produces annotations with an
accuracy of over 97% on lemma, part-of-speech, and tokenization
combined.
A Proposal for a Part-of-Speech Tagset for theAlbanian Language
Besim Kabashi and Thomas Proisl
Part-of-speech tagging is a basic step in Natural Language
Processing that is often essential. Labeling the word forms of
a text with fine-grained word-class information adds new value
to it and can be a prerequisite for downstream processes like
a dependency parser. Corpus linguists and lexicographers also
benefit greatly from the improved search options that are available
with tagged data. The Albanian language has some properties that
pose difficulties for the creation of a part-of-speech tagset. In this
paper, we discuss those difficulties and present a proposal for a
part-of-speech tagset that can adequately represent the underlying
linguistic phenomena.
Using a Small Lexicon with CRFs ConfidenceMeasure to Improve POS Tagging Accuracy
Mohamed Outahajala and Paolo Rosso
Like most of the languages which have only recently started being
investigated for the Natural Language Processing (NLP) tasks,
Amazigh lacks annotated corpora and tools and still suffers from
the scarcity of linguistic tools and resources. The main aim of
this paper is to present a new part-of-speech (POS) tagger based
on a new Amazigh tag set (AMTS) composed of 28 tags. In line
with our goal we have trained Conditional Random Fields (CRFs)
to build a POS tagger for the Amazigh language. We have used
the 10-fold technique to evaluate and validate our approach. The
CRFs 10 folds average level is 87.95% and the best fold level
result is 91.18%. In order to improve this result, we have gathered
a set of about 8k words with their POS tags. The collected
lexicon was used with CRFs confidence measure in order to have
a more accurate POS-tagger. Hence, we have obtained a better
performance of 93.82%.
Learning from Within? Comparing PoS TaggingApproaches for Historical Text
Sarah Schulz and Jonas Kuhn
In this paper, we investigate unsupervised and semi-supervised
methods for part-of-speech (PoS) tagging in the context of
historical German text. We locate our research in the context
of Digital Humanities where the non-canonical nature of text
causes issues facing an Natural Language Processing world in
which tools are mainly trained on standard data. Data deviating
from the norm requires tools adjusted to this data. We explore
to which extend the availability of such training material and
resources related to it influences the accuracy of PoS tagging.
We investigate a variety of algorithms including neural nets,
conditional random fields and self-learning techniques in order
to find the best-fitted approach to tackle data sparsity. Although
methods using resources from related languages outperform
weakly supervised methods using just a few training examples,
149
we can still reach a promising accuracy with methods abstaining
additional resources.
O45 - Lexicons: Wordnet and FramenetFriday, May 27, 14:55
Chairperson: Dan Tufis Oral Session
Wow! What a Useful Extension! IntroducingNon-Referential Concepts to Wordnet
Luís Morgado da Costa and Francis Bond
In this paper we present the ongoing efforts to expand the
depth and breath of the Open Multilingual Wordnet coverage
by introducing two new classes of non-referential concepts to
wordnet hierarchies: interjections and numeral classifiers. The
lexical semantic hierarchy pioneered by Princeton Wordnet has
traditionally restricted its coverage to referential and contentful
classes of words: such as nouns, verbs, adjectives and adverbs.
Previous efforts have been employed to enrich wordnet resources
including, for example, the inclusion of pronouns, determiners
and quantifiers within their hierarchies. Following similar efforts,
and motivated by the ongoing semantic annotation of the NTU-
Multilingual Corpus, we decided that the four traditional classes
of words present in wordnets were too restrictive. Though
non-referential, interjections and classifiers possess interesting
semantics features that can be well captured by lexical resources
like wordnets. In this paper, we will further motivate our
decision to include non-referential concepts in wordnets and give
an account of the current state of this expansion.
SlangNet: A WordNet like resource for EnglishSlang
Shehzaad Dhuliawala, Diptesh Kanojia and PushpakBhattacharyya
We present a WordNet like structured resource for slang words and
neologisms on the internet. The dynamism of language is often
an indication that current language technology tools trained on
today’s data, may not be able to process the language in the future.
Our resource could be (1) used to augment the WordNet, (2) used
in several Natural Language Processing (NLP) applications which
make use of noisy data on the internet like Information Retrieval
and Web Mining. Such a resource can also be used to distinguish
slang word senses from conventional word senses. To stimulate
similar innovations widely in the NLP community, we test the
efficacy of our resource for detecting slang using standard bag of
words Word Sense Disambiguation (WSD) algorithms (Lesk and
Extended Lesk) for English data on the internet.
Discovering Fuzzy Synsets from the Redundancyin Different Lexical-Semantic Resources
Hugo Gonçalo Oliveira and Fábio Santos
Although represented as such in wordnets, word senses are
not discrete. To handle word senses as fuzzy objects, we
exploit the graph structure of synonymy pairs acquired from
different sources to discover synsets where words have different
membership degrees that reflect confidence. Following this
approach, a wide-coverage fuzzy thesaurus was discovered from
a synonymy network compiled from seven Portuguese lexical-
semantic resources. Based on a crowdsourcing evaluation, we
can say that the quality of the obtained synsets is far from perfect
but, as expected in a confidence measure, it increases significantly
for higher cut-points on the membership and, at a certain point,
reaches 100% correction rate.
The Hebrew FrameNet Project
Avi Hayoun and Michael Elhadad
We present the Hebrew FrameNet project, describe the
development and annotation processes and enumerate the
challenges we faced along the way. We have developed semi-
automatic tools to help speed the annotation and data collection
process. The resource currently covers 167 frames, 3,000 lexical
units and about 500 fully annotated sentences. We have started
training and testing automatic SRL tools on the seed data.
O46 - Digital HumanitiesFriday, May 27, 14:55
Chairperson: Andreas Witt Oral Session
An Open Corpus for Named Entity Recognition inHistoric Newspapers
Clemens Neudecker
The availability of openly available textual datasets (“corpora”)
with highly accurate manual annotations (“gold standard” of
named entities (e.g. persons, locations, organizations, etc.) is
crucial in the training and evaluation of named entity recognition
systems. Currently there are only few such datasets available
on the web, and even less for texts containing historical spelling
variation. The production and subsequent release into the public
domain of four such datasets with 100 pages each for the
languages Dutch, French, German (including Austrian) as part
of the Europeana Newspapers project is expected to contribute
to the further development and improvement of named entity
150
recognition systems with a focus on historical content. This paper
describes how these datasets were produced, what challenges were
encountered in their creation and informs about their final quality
and availability.
Ambiguity Diagnosis for Terms in DigitalHumanities
Béatrice Daille, Evelyne Jacquey, Gaël Lejeune, LuisFelipe Melo and Yannick Toussaint
Among all researches dedicating to terminology and word sense
disambiguation, little attention has been devoted to the ambiguity
of term occurrences. If a lexical unit is indeed a term of
the domain, it is not true, even in a specialised corpus, that
all its occurrences are terminological. Some occurrences are
terminological and other are not. Thus, a global decision at the
corpus level about the terminological status of all occurrences of a
lexical unit would then be erroneous. In this paper, we propose
three original methods to characterise the ambiguity of term
occurrences in the domain of social sciences for French. These
methods differently model the context of the term occurrences:
one is relying on text mining, the second is based on textometry,
and the last one focuses on text genre properties. The experimental
results show the potential of the proposed approaches and give an
opportunity to discuss about their hybridisation.
Metrical Annotation of a Large Corpus of SpanishSonnets: Representation, Scansion and Evaluation
Borja Navarro, María Ribes-Lafoz and Noelia Sánchez
In order to analyze metrical and semantics aspects of poetry in
Spanish with computational techniques, we have developed a
large corpus annotated with metrical information. In this paper
we will present and discuss the development of this corpus: the
formal representation of metrical patterns, the semi-automatic
annotation process based on a new automatic scansion system, the
main annotation problems, and the evaluation, in which an inter-
annotator agreement of 96% has been obtained. The corpus is
open and available.
Corpus Analysis based on Structural Phenomenain Texts: Exploiting TEI Encoding for LinguisticResearch
Susanne Haaf
This paper poses the question, how linguistic corpus-based
research may be enriched by the exploitation of conceptual text
structures and layout as provided via TEI annotation. Examples
for possible areas of research and usage scenarios are provided
based on the German historical corpus of the Deutsches Textarchiv
(DTA) project, which has been consistently tagged accordant
to the TEI Guidelines, more specifically to the DTA ›Base
Format‹ (DTABf). The paper shows that by including TEI-XML
structuring in corpus-based analyses significances can be observed
for different linguistic phenomena, as e.g. the development of
conceptual text structures themselves, the syntactic embedding
of terms in certain conceptual text structures, and phenomena of
language change which become obvious via the layout of a text.
The exemplary study carried out here shows some of the potential
for the exploitation of TEI annotation for linguistic research,
which might be kept in mind when making design decisions for
new corpora as well when working with existing TEI corpora.
O47 - Text Mining and InformationExtractionFriday, May 27, 14:55
Chairperson: Gregory Grefenstette Oral Session
Evaluating Entity Linking: An Analysis ofCurrent Benchmark Datasets and a Roadmap forDoing a Better Job
Marieke van Erp, Pablo Mendes, Heiko Paulheim, FilipIlievski, Julien Plu, Giuseppe Rizzo and Joerg Waitelonis
Entity linking has become a popular task in both natural language
processing and semantic web communities. However, we find that
the benchmark datasets for entity linking tasks do not accurately
evaluate entity linking systems. In this paper, we aim to chart
the strengths and weaknesses of current benchmark datasets and
sketch a roadmap for the community to devise better benchmark
datasets.
Studying the Temporal Dynamics of WordCo-occurrences: An Application to EventDetection
Daniel Preotiuc-Pietro, P. K. Srijith, Mark Hepple andTrevor Cohn
Streaming media provides a number of unique challenges for
computational linguistics. This paper studies the temporal
variation in word co-occurrence statistics, with application to
event detection. We develop a spectral clustering approach to find
groups of mutually informative terms occurring in discrete time
frames. Experiments on large datasets of tweets show that these
groups identify key real world events as they occur in time, despite
no explicit supervision. The performance of our method rivals
151
state-of-the-art methods for event detection on F-score, obtaining
higher recall at the expense of precision.
Markov Logic Networks for Text Mining: AQualitative and Empirical Comparison withInteger Linear Programming
Luis Gerardo Mojica de la Vega and Vincent Ng
Joint inference approaches such as Integer Linear Programming
(ILP) and Markov Logic Networks (MLNs) have recently been
successfully applied to many natural language processing (NLP)
tasks, often outperforming their pipeline counterparts. However,
MLNs are arguably much less popular among NLP researchers
than ILP. While NLP researchers who desire to employ these joint
inference frameworks do not necessarily have to understand their
theoretical underpinnings, it is imperative that they understand
which of them should be applied under what circumstances.
With the goal of helping NLP researchers better understand the
relative strengths and weaknesses of MLNs and ILP; we will
compare them along different dimensions of interest, such as
expressiveness, ease of use, scalability, and performance. To our
knowledge, this is the first systematic comparison of ILP and
MLNs on an NLP task.
Arabic Corpora for Credibility Analysis
Ayman Al Zaatari, Rim El Ballouli, Shady ELbassouni,Wassim El-Hajj, Hazem Hajj, Khaled Shaban, NizarHabash and Emad Yahya
A significant portion of data generated on blogging and
microblogging websites is non-credible as shown in many recent
studies. To filter out such non-credible information, machine
learning can be deployed to build automatic credibility classifiers.
However, as in the case with most supervised machine learning
approaches, a sufficiently large and accurate training data must
be available. In this paper, we focus on building a public Arabic
corpus of blogs and microblogs that can be used for credibility
classification. We focus on Arabic due to the recent popularity of
blogs and microblogs in the Arab World and due to the lack of any
such public corpora in Arabic. We discuss our data acquisition
approach and annotation process, provide rigid analysis on the
annotated data and finally report some results on the effectiveness
of our data for credibility classification.
O48 - Corpus Creation and AnalysisFriday, May 27, 14:55
Chairperson: Paul Rayson Oral Session
Solving the AL Chicken-and-Egg Corpus andModel Problem: Model-free Active Learning forPhenomena-driven Corpus Construction
Dain Kaplan, Neil Rubens, Simone Teufel and TakenobuTokunaga
Active learning (AL) is often used in corpus construction (CC)
for selecting “informative” documents for annotation. This is
ideal for focusing annotation efforts when all documents cannot
be annotated, but has the limitation that it is carried out in a
closed-loop, selecting points that will improve an existing model.
For phenomena-driven and exploratory CC, the lack of existing-
models and specific task(s) for using it make traditional AL
inapplicable. In this paper we propose a novel method for model-
free AL utilising characteristics of phenomena for applying AL to
select documents for annotation. The method can also supplement
traditional closed-loop AL-based CC to extend the utility of the
corpus created beyond a single task. We introduce our tool,
MOVE, and show its potential with a real world case-study.
QUEMDISSE? Reported speech in Portuguese
Cláudia Freitas, Bianca Freitas and Diana Santos
This paper presents some work on direct and indirect speech in
Portuguese using corpus-based methods: we report on a study
whose aim was to identify (i) Portuguese verbs used to introduce
reported speech and (ii) syntactic patterns used to convey reported
speech, in order to enhance the performance of a quotation
extraction system, dubbed QUEMDISSE?. In addition, (iii) we
present a Portuguese corpus annotated with reported speech, using
the lexicon and rules provided by (i) and (ii), and discuss the
process of their annotation and what was learned.
MEANTIME, the NewsReader MultilingualEvent and Time Corpus
Anne-Lyse Minard, Manuela Speranza, Ruben Urizar,Begoña Altuna, Marieke van Erp, Anneleen Schoen andChantal van Son
In this paper, we present the NewsReader MEANTIME corpus, a
semantically annotated corpus of Wikinews articles. The corpus
consists of 480 news articles, i.e. 120 English news articles and
their translations in Spanish, Italian, and Dutch. MEANTIME
contains annotations at different levels. The document-level
annotation includes markables (e.g. entity mentions, event
mentions, time expressions, and numerical expressions), relations
152
between markables (modeling, for example, temporal information
and semantic role labeling), and entity and event intra-document
coreference. The corpus-level annotation includes entity and
event cross-document coreference. Semantic annotation on the
English section was performed manually; for the annotation in
Italian, Spanish, and (partially) Dutch, a procedure was devised to
automatically project the annotations on the English texts onto the
translated texts, based on the manual alignment of the annotated
elements; this enabled us not only to speed up the annotation
process but also provided cross-lingual coreference. The English
section of the corpus was extended with timeline annotations for
the SemEval 2015 TimeLine shared task. The “First CLIN Dutch
Shared Task” at CLIN26 was based on the Dutch section, while
the EVALITA 2016 FactA (Event Factuality Annotation) shared
task, based on the Italian section, is currently being organized.
The ACQDIV Database: Min(d)ing the AmbientLanguage
Steven Moran
One of the most pressing questions in cognitive science remains
unanswered: what cognitive mechanisms enable children to learn
any of the world’s 7000 or so languages? Much discovery has
been made with regard to specific learning mechanisms in specific
languages, however, given the remarkable diversity of language
structures (Evans and Levinson 2009, Bickel 2014) the burning
question remains: what are the underlying processes that make
language acquisition possible, despite substantial cross-linguistic
variation in phonology, morphology, syntax, etc.? To investigate
these questions, a comprehensive cross-linguistic database of
longitudinal child language acquisition corpora from maximally
diverse languages has been built.
P53 - Dialogue (2)Friday, May 27, 14:55
Chairperson: Thorsten Trippel Poster Session
Summarizing Behaviours: An Experiment on theAnnotation of Call-Centre Conversations
Morena Danieli, Balamurali A R, Evgeny Stepanov, BenoitFavre, Frederic Bechet and Giuseppe Riccardi
Annotating and predicting behavioural aspects in conversations is
becoming critical in the conversational analytics industry. In this
paper we look into inter-annotator agreement of agent behaviour
dimensions on two call center corpora. We find that the task can
be annotated consistently over time, but that subjectivity issues
impacts the quality of the annotation. The reformulation of some
of the annotated dimensions is suggested in order to improve
agreement.
Survey of Conversational Behavior: Towards theDesign of a Balanced Corpus of EverydayJapanese Conversation
Hanae Koiso, Tomoyuki Tsuchiya, Ryoko Watanabe,Daisuke Yokomori, Masao Aizawa and Yasuharu Den
In 2016, we set about building a large-scale corpus of everyday
Japanese conversation–a collection of conversations embedded
in naturally occurring activities in daily life. We will collect
more than 200 hours of recordings over six years,publishing
the corpus in 2022. To construct such a huge corpus, we
have conducted a pilot project, one of whose purposes is to
establish a corpus design for collecting various kinds of everyday
conversations in a balanced manner. For this purpose, we
conducted a survey of everyday conversational behavior, with
about 250 adults, in order to reveal how diverse our everyday
conversational behavior is and to build an empirical foundation
for corpus design. The questionnaire included when, where, how
long,with whom, and in what kind of activity informants were
engaged in conversations. We found that ordinary conversations
show the following tendencies: i) they mainly consist of chats,
business talks, and consultations; ii) in general, the number of
participants is small and the duration of the conversation is short;
iii) many conversations are conducted in private places such as
homes, as well as in public places such as offices and schools; and
iv) some questionnaire items are related to each other. This paper
describes an overview of this survey study, and then discusses how
to design a large-scale corpus of everyday Japanese conversation
on this basis.
A Multi-party Multi-modal Dataset for Focus ofVisual Attention in Human-human andHuman-robot Interaction
Kalin Stefanov and Jonas Beskow
This papers describes a data collection setup and a newly recorded
dataset. The main purpose of this dataset is to explore patterns
in the focus of visual attention of humans under three different
conditions - two humans involved in task-based interaction with a
robot; same two humans involved in task-based interaction where
the robot is replaced by a third human, and a free three-party
human interaction. The dataset contains two parts - 6 sessions with
duration of approximately 3 hours and 9 sessions with duration
of approximately 4.5 hours. Both parts of the dataset are rich in
modalities and recorded data streams - they include the streams
of three Kinect v2 devices (color, depth, infrared, body and face
data), three high quality audio streams, three high resolution
153
GoPro video streams, touch data for the task-based interactions
and the system state of the robot. In addition, the second part
of the dataset introduces the data streams from three Tobii Pro
Glasses 2 eye trackers. The language of all interactions is English
and all data streams are spatially and temporally aligned.
Internet Argument Corpus 2.0: An SQL schemafor Dialogic Social Media and the Corpora to gowith it
Rob Abbott, Brian Ecker, Pranav Anand and MarilynWalker
Large scale corpora have benefited many areas of research in
natural language processing, but until recently, resources for
dialogue have lagged behind. Now, with the emergence of large
scale social media websites incorporating a threaded dialogue
structure, content feedback, and self-annotation (such as stance
labeling), there are valuable new corpora available to researchers.
In previous work, we released the INTERNET ARGUMENT
CORPUS, one of the first larger scale resources available for
opinion sharing dialogue. We now release the INTERNET
ARGUMENT CORPUS 2.0 (IAC 2.0) in the hope that others will
find it as useful as we have. The IAC 2.0 provides more data
than IAC 1.0 and organizes it using an extensible, repurposable
SQL schema. The database structure in conjunction with the
associated code facilitates querying from and combining multiple
dialogically structured data sources. The IAC 2.0 schema provides
support for forum posts, quotations, markup (bold, italic, etc), and
various annotations, including Stanford CoreNLP annotations.
We demonstrate the generalizablity of the schema by providing
code to import the ConVote corpus.
Capturing Chat: Annotation and Tools forMultiparty Casual Conversation.
Emer Gilmartin and Nick Campbell
Casual multiparty conversation is an understudied but very
common genre of spoken interaction, whose analysis presents a
number of challenges in terms of data scarcity and annotation.
We describe the annotation process used on the d64 and DANS
multimodal corpora of multiparty casual talk, which have been
manually segmented, transcribed, annotated for laughter and
disfluencies, and aligned using the Penn Aligner. We also describe
a visualization tool, STAVE, developed during the annotation
process, which allows long stretches of talk or indeed entire
conversations to be viewed, aiding preliminary identification of
features and patterns worthy of analysis. It is hoped that this tool
will be of use to other researchers working in this field.
P54 - LR Infrastructures and Architectures (2)Friday, May 27, 14:55
Chairperson: Koiti Hasida Poster Session
Privacy Issues in Online Machine TranslationServices - European PerspectivePawel Kamocki and Jim O’Regan
In order to develop its full potential, global communication
needs linguistic support systems such as Machine Translation
(MT). In the past decade, free online MT tools have become
available to the general public, and the quality of their output is
increasing. However, the use of such tools may entail various legal
implications, especially as far as processing of personal data is
concerned. This is even more evident if we take into account that
their business model is largely based on providing translation in
exchange for data, which can subsequently be used to improve the
translation model, but also for commercial purposes. The purpose
of this paper is to examine how free online MT tools fit in the
European data protection framework, harmonised by the EU Data
Protection Directive. The perspectives of both the user and the
MT service provider are taken into account.
Lin|gu|is|tik: Building the Linguist’s Pathway toBibliographies, Libraries, Language Resourcesand Linked Open DataChristian Chiarcos, Christian Fäth, Heike Renner-Westermann, Frank Abromeit and Vanya Dimitrova
This paper introduces a novel research tool for the field of
linguistics: The Lin|gu|is|tik web portal provides a virtual
library which offers scientific information on every linguistic
subject. It comprises selected internet sources and databases
as well as catalogues for linguistic literature, and addresses
an interdisciplinary audience. The virtual library is the most
recent outcome of the Special Subject Collection Linguistics
of the German Research Foundation (DFG), and also integrates
the knowledge accumulated in the Bibliography of Linguistic
Literature. In addition to the portal, we describe long-term goals
and prospects with a special focus on ongoing efforts regarding an
extension towards integrating language resources and Linguistic
Linked Open Data.
Towards a Language Service Infrastructure forMobile EnvironmentsNgoc Nguyen, Donghui Lin, Takao Nakaguchi and ToruIshida
Since mobile devices have feature-rich configurations and provide
diverse functions, the use of mobile devices combined with the
language resources of cloud environments is high promising
for achieving a wide range communication that goes beyond
the current language barrier. However, there are mismatches
154
between using resources of mobile devices and services in the
cloud such as the different communication protocol and different
input and output methods. In this paper, we propose a language
service infrastructure for mobile environments to combine these
services. The proposed language service infrastructure allows
users to use and mashup existing language resources on both cloud
environments and their mobile devices. Furthermore, it allows
users to flexibly use services in the cloud or services on mobile
devices in their composite service without implementing several
different composite services that have the same functionality. A
case study of Mobile Shopping Translation System using both a
service in the cloud (translation service) and services on mobile
devices (Bluetooth low energy (BLE) service and text-to-speech
service) is introduced.
Designing A Long Lasting Linguistic Project: TheCase Study of ASIt
Maristella Agosti, Emanuele Di Buccio, Giorgio Maria DiNunzio, Cecilia Poletto and Esther Rinke
In this paper, we discuss the requirements that a long lasting
linguistic database should have in order to meet the needs of the
linguists together with the aim of durability and sharing of data.
In particular, we discuss the generalizability of the Syntactic Atlas
of Italy, a linguistic project that builds on a long standing tradition
of collecting and analyzing linguistic corpora, on a more recent
project that focuses on the synchronic and diachronic analysis
of the syntax of Italian and Portuguese relative clauses. The
results that are presented are in line with the FLaReNet Strategic
Agenda that highlighted the most pressing needs for research
areas, such as Natural Language Processing, and presented a set of
recommendations for the development and progress of Language
resources in Europe.
Global Open Resources and Information forLanguage and Linguistic Analysis (GORILLA)
Damir Cavar, Malgorzata Cavar and Lwin Moe
The infrastructure Global Open Resources and Information for
Language and Linguistic Analysis (GORILLA) was created as
a resource that provides a bridge between disciplines such as
documentary, theoretical, and corpus linguistics, speech and
language technologies, and digital language archiving services.
GORILLA is designed as an interface between digital language
archive services and language data producers. It addresses various
problems of common digital language archive infrastructures.
At the same time it serves the speech and language technology
communities by providing a platform to create and share speech
and language data from low-resourced and endangered languages.
It hosts an initial collection of language models for speech and
natural language processing (NLP), and technologies or software
tools for corpus creation and annotation. GORILLA is designed to
address the Transcription Bottleneck in language documentation,
and, at the same time to provide solutions to the general Language
Resource Bottleneck in speech and language technologies. It
does so by facilitating the cooperation between documentary
and theoretical linguistics, and speech and language technologies
research and development, in particular for low-resourced and
endangered languages.
corpus-tools.org: An Interoperable GenericSoftware Tool Set for Multi-layer LinguisticCorpora
Stephan Druskat, Volker Gast, Thomas Krause and FlorianZipser
This paper introduces an open source, interoperable generic
software tool set catering for the entire workflow of creation,
migration, annotation, query and analysis of multi-layer linguistic
corpora. It consists of four components: Salt, a graph-based meta
model and API for linguistic data, the common data model for
the rest of the tool set; Pepper, a conversion tool and platform for
linguistic data that can be used to convert many different linguistic
formats into each other; Atomic, an extensible, platform-
independent multi-layer desktop annotation software for linguistic
corpora; ANNIS, a search and visualization architecture for multi-
layer linguistic corpora with many different visualizations and a
powerful native query language. The set was designed to solve the
following issues in a multi-layer corpus workflow: Lossless data
transition between tools through a common data model generic
enough to allow for a potentially unlimited number of different
types of annotation, conversion capabilities for different linguistic
formats to cater for the processing of data from different sources
and/or with existing annotations, a high level of extensibility
to enhance the sustainability of the whole tool set, analysis
capabilities encompassing corpus and annotation query alongside
multi-faceted visualizations of all annotation layers.
CommonCOW: Massively Huge Web Corporafrom CommonCrawl Data and a Method toDistribute them Freely under Restrictive EUCopyright Laws
Roland Schäfer
In this paper, I describe a method of creating massively huge
web corpora from the CommonCrawl data sets and redistributing
the resulting annotations in a stand-off format. Current EU (and
especially German) copyright legislation categorically forbids
the redistribution of downloaded material without express prior
permission by the authors. Therefore, such stand-off annotations
(or other derivates) are the only format in which European
155
researchers (like myself) are allowed to re-distribute the respective
corpora. In order to make the full corpora available to the
public despite such restrictions, the stand-off format presented
here allows anybody to locally reconstruct the full corpora with
the least possible computational effort.
CLARIN-EL Web-based Annotation Tool
Ioannis Manousos Katakis, Georgios Petasis and VangelisKarkaletsis
This paper presents a new Web-based annotation tool, the
“CLARIN-EL Web-based Annotation Tool”. Based on an existing
annotation infrastructure offered by the “Ellogon” language
enginneering platform, this new tool transfers a large part of
Ellogon’s features and functionalities to a Web environment,
by exploiting the capabilities of cloud computing. This new
annotation tool is able to support a wide range of annotation
tasks, through user provided annotation schemas in XML. The
new annotation tool has already been employed in several
annotation tasks, including the anotation of arguments, which
is presented as a use case. The CLARIN-EL annotation tool
is compared to existing solutions along several dimensions and
features. Finally, future work includes the improvement of
integration with the CLARIN-EL infrastructure, and the inclusion
of features not currently supported, such as the annotation of
aligned documents.
Two Architectures for Parallel Processing of HugeAmounts of Text
Mathijs Kattenberg, Zuhaitz Beloki, Aitor Soroa, XabierArtola, Antske Fokkens, Paul Huygen and Kees Verstoep
This paper presents two alternative NLP architectures to analyze
massive amounts of documents, using parallel processing. The
two architectures focus on different processing scenarios, namely
batch-processing and streaming processing. The batch-processing
scenario aims at optimizing the overall throughput of the
system, i.e., minimizing the overall time spent on processing all
documents. The streaming architecture aims to minimize the
time to process real-time incoming documents and is therefore
especially suitable for live feeds. The paper presents experiments
with both architectures, and reports the overall gain when they
are used for batch as well as for streaming processing. All the
software described in the paper is publicly available under free
licenses.
Publishing the Trove Newspaper Corpus
Steve Cassidy
The Trove Newspaper Corpus is derived from the National Library
of Australia’s digital archive of newspaper text. The corpus is a
snapshot of the NLA collection taken in 2015 to be made available
for language research as part of the Alveo Virtual Laboratory and
contains 143 million articles dating from 1806 to 2007. This
paper describes the work we have done to make this large corpus
available as a research collection, facilitating access to individual
documents and enabling large scale processing of the newspaper
text in a cloud-based environment.
New Developments in the LRE MapVladimir Popescu, Lin Liu, Riccardo Del Gratta, KhalidChoukri and Nicoletta Calzolari
In this paper we describe the new developments brought to
LRE Map, especially in terms of the user interface of the Web
application, of the searching of the information therein, and of
the data model updates. Thus, users now have several new search
facilities, such as faceted search and fuzzy textual search, they
can now register, log in and store search bookmarks for further
perusal. Moreover, the data model now includes the notion of
paper and author, which allows for linking the resources to the
scientific works. Also, users can now visualise author-provided
field values and normalised values. The normalisation has been
manual and enables a better grouping of the entries. Last but not
least, provisions have been made towards linked open data (LOD)
aspects, by exposing an RDF access point allowing to query on the
authors, papers and resources. Finally, a complete technological
overhaul of the whole application has been undertaken, especially
in terms of the Web infrastructure and of the text search backend.
P55 - Large Projects and Infrastructures (2)Friday, May 27, 14:55
Chairperson: Dieter Van Uytvanck Poster Session
Hidden Resources – Strategies to Acquire andExploit Potential Spoken Language Resources inNational ArchivesJens Edlund and Joakim Gustafson
In 2014, the Swedish government tasked a Swedish agency, The
Swedish Post and Telecom Authority (PTS), with investigating
how to best create and populate an infrastructure for spoken
language resources (Ref N2014/2840/ITP). As a part of this work,
the department of Speech, Music and Hearing at KTH Royal
Institute of Technology have taken inventory of existing potential
spoken language resources, mainly in Swedish national archives
and other governmental or public institutions. In this position
paper, key priorities, perspectives, and strategies that may be of
general, rather than Swedish, interest are presented. We discuss
broad types of potential spoken language resources available; to
what extent these resources are free to use; and thirdly the main
contribution: strategies to ensure the continuous acquisition of
156
spoken language resources in a manner that facilitates speech and
speech technology research.
The ELRA License Wizard
Valérie Mapelli, Vladimir Popescu, Lin Liu, MeritxellFernández Barrera and Khalid Choukri
To allow an easy understanding of the various licenses that exist
for the use of Language Resources (ELRA’s, META-SHARE’s,
Creative Commons’, etc.), ELRA has developed a License
Wizardto help the right-holders share/distribute their resources
under the appropriate license. It also aims to be exploited by
users to better understand the legal obligations that apply in
various licensing situations. The present paper elaborates on the
License Wizard functionalities of this web configurator, which
enables to select a number of legal features and obtain the user
license adapted to the users selection, to define which user licenses
they would like to select in order to distribute their Language
Resources, to integrate the user license terms into a Distribution
Agreement that could be proposed to ELRA or META-SHARE
for further distribution through the ELRA Catalogue of Language
Resources. Thanks to a flexible back office, the structure of the
legal feature selection can easily be reviewed to include other
features that may be relevant for other licenses. Integrating
contributions from other initiatives thus aim to be one of the
obvious next steps, with a special focus on CLARIN and Linked
Data experiences.
Review on the Existing Language Resources forLanguages of France
Thibault Grouas, Valérie Mapelli and Quentin Samier
With the support of the DGLFLF, ELDA conducted an inventory
of existing language resources for the regional languages of
France. The main aim of this inventory was to assess the
exploitability of the identified resources within technologies.
A total of 2,299 Language Resources were identified. As a
second step, a deeper analysis of a set of three language groups
(Breton, Occitan, overseas languages) was carried out along with
a focus of their exploitability within three technologies: automatic
translation, voice recognition/synthesis and spell checkers. The
survey was followed by the organisation of the TLRF2015
Conference which aimed to present the state of the art in the field
of the Technologies for Regional Languages of France. The next
step will be to activate the network of specialists built up during
the TLRF conference and to begin the organisation of a second
TLRF conference. Meanwhile, the French Ministry of Culture
continues its actions related to linguistic diversity and technology,
in particular through a project with Wikimedia France related to
contributions to Wikipedia in regional languages, the upcoming
new version of the “Corpus de la Parole” and the reinforcement of
the DGLFLF’s Observatory of Linguistic Practices.
Selection Criteria for Low Resource LanguagePrograms
Christopher Cieri, Mike Maxwell, Stephanie Strassel andJennifer Tracey
This paper documents and describes the criteria used to select
languages for study within programs that include low resource
languages whether given that label or another similar one. It
focuses on five US common task, Human Language Technology
research and development programs in which the authors have
provided information or consulting related to the choice of
language. The paper does not describe the actual selection process
which is the responsibility of program management and highly
specific to a program’s individual goals and context. Instead it
concentrates on the data and criteria that have been considered
relevant previously with the thought that future program managers
and their consultants may adapt these and apply them with
different prioritization to future programs.
Enhancing Cross-border EU E-commerce throughMachine Translation: Needed LanguageResources, Challenges and Opportunities
Meritxell Fernández Barrera, Vladimir Popescu, AntonioToral, Federico Gaspari and Khalid Choukri
This paper discusses the role that statistical machine translation
(SMT) can play in the development of cross-border EU
e-commerce,by highlighting extant obstacles and identifying
relevant technologies to overcome them. In this sense, it firstly
proposes a typology of e-commerce static and dynamic textual
genres and it identifies those that may be more successfully
targeted by SMT. The specific challenges concerning the
automatic translation of user-generated content are discussed
in detail. Secondly, the paper highlights the risk of data
sparsity inherent to e-commerce and it explores the state-of-
the-art strategies to achieve domain adequacy via adaptation.
Thirdly, it proposes a robust workflow for the development of
SMT systems adapted to the e-commerce domain by relying
on inexpensive methods. Given the scarcity of user-generated
language corpora for most language pairs, the paper proposes to
obtain monolingual target-language data to train language models
and aligned parallel corpora to tune and evaluate MT systems by
means of crowdsourcing.
157
P56 - Semantics (2)Friday, May 27, 14:55
Chairperson: Yoshihiko Hayashi Poster Session
Nine Features in a Random Forest to LearnTaxonomical Semantic Relations
Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Luand Chu-Ren Huang
ROOT9 is a supervised system for the classification of hypernyms,
co-hyponyms and random words that is derived from the already
introduced ROOT13 (Santus et al., 2016). It relies on a Random
Forest algorithm and nine unsupervised corpus-based features.
We evaluate it with a 10-fold cross validation on 9,600 pairs,
equally distributed among the three classes and involving several
Parts-Of-Speech (i.e. adjectives, nouns and verbs). When all the
classes are present, ROOT9 achieves an F1 score of 90.7%, against
a baseline of 57.2% (vector cosine). When the classification is
binary, ROOT9 achieves the following results against the baseline.
hypernyms-co-hyponyms 95.7% vs. 69.8%, hypernyms-random
91.8% vs. 64.1% and co-hyponyms-random 97.8% vs. 79.4%.
In order to compare the performance with the state-of-the-art,
we have also evaluated ROOT9 in subsets of the Weeds et al.
(2014) datasets, proving that it is in fact competitive. Finally, we
investigated whether the system learns the semantic relation or it
simply learns the prototypical hypernyms, as claimed by Levy et
al. (2015). The second possibility seems to be the most likely,
even though ROOT9 can be trained on negative examples (i.e.,
switched hypernyms) to drastically reduce this bias.
What a Nerd! Beating Students and Vector Cosinein the ESL and TOEFL Datasets
Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Luand Chu-Ren Huang
In this paper, we claim that Vector Cosine – which is generally
considered one of the most efficient unsupervised measures
for identifying word similarity in Vector Space Models – can
be outperformed by a completely unsupervised measure that
evaluates the extent of the intersection among the most associated
contexts of two target words, weighting such intersection
according to the rank of the shared contexts in the dependency
ranked lists. This claim comes from the hypothesis that similar
words do not simply occur in similar contexts, but they share a
larger portion of their most relevant contexts compared to other
related words. To prove it, we describe and evaluate APSyn, a
variant of Average Precision that – independently of the adopted
parameters – outperforms the Vector Cosine and the co-occurrence
on the ESL and TOEFL test sets. In the best setting, APSyn
reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy in
the TOEFL dataset, beating therefore the non-English US college
applicants (whose average, as reported in the literature, is 64.50%)
and several state-of-the-art approaches.
Assessing the Potential of Metaphoricity of verbsusing corpus data
Marco Del Tredici and Nuria Bel
The paper investigates the relation between metaphoricity and
distributional characteristics of verbs, introducing POM, a corpus-
derived index that can be used to define the upper bound of
metaphoricity of any expression in which a given verb occurs.
The work moves from the observation that while some verbs
can be used to create highly metaphoric expressions, others can
not. We conjecture that this fact is related to the number of
contexts in which a verb occurs and to the frequency of each
context. This intuition is modelled by introducing a method in
which each context of a verb in a corpus is assigned a vector
representation, and a clustering algorithm is employed to identify
similar contexts. Eventually, the Standard Deviation of the relative
frequency values of the clusters is computed and taken as the POM
of the target verb. We tested POM in two experimental settings
obtaining values of accuracy of 84% and 92%. Since we are
convinced, along with (Shutoff, 2015), that metaphor detection
systems should be concerned only with the identification of highly
metaphoric expressions, we believe that POM could be profitably
employed by these systems to a priori exclude expressions that,
due to the verb they include, can only have low degrees of
metaphoricity
Semantic Relation Extraction with SemanticPatterns Experiment on Radiology Reports
Mathieu Lafourcade and Lionel Ramadier
This work presents a practical system for indexing terms and
relations from French radiology reports, called IMAIOS. In this
paper, we present how semantic relations (causes, consequences,
symptoms, locations, parts. . . ) between medical terms can be
extracted. For this purpose, we handcrafted some linguistic
patterns from on a subset of our radiology report corpora. As
semantic patterns (de (of)) may be too general or ambiguous,
semantic constraints have been added. For instance, in the
sentence néoplasie du sein (neoplasm of breast) the system
knowing neoplasm as a disease and breast as an anatomical
location, identify the relation as being a location: neoplasm r-
158
lieu breast. An evaluation of the effect of semantic constraints
is proposed.
EVALution-MAN: A Chinese Dataset for theTraining and Evaluation of DSMs
Liu Hongchao, Karl Neergaard, Enrico Santus and Chu-Ren Huang
Distributional semantic models (DSMs) are currently being used
in the measurement of word relatedness and word similarity. One
shortcoming of DSMs is that they do not provide a principled way
to discriminate different semantic relations. Several approaches
have been adopted that rely on annotated data either in the
training of the model or later in its evaluation. In this paper, we
introduce a dataset for training and evaluating DSMs on semantic
relations discrimination between words, in Mandarin, Chinese.
The construction of the dataset followed EVALution 1.0, which
is an English dataset for the training and evaluating of DSMs.
The dataset contains 360 relation pairs, distributed in five different
semantic relations, including antonymy, synonymy, hypernymy,
meronymy and nearsynonymy. All relation pairs were checked
manually to estimate their quality. In the 360 word relation pairs,
there are 373 relata. They were all extracted and subsequently
manually tagged according to their semantic type. The relatas’
frequency was calculated in a combined corpus of Sinica and
Chinese Gigaword. To the best of our knowledge, EVALution-
MAN is the first of its kind for Mandarin, Chinese.
Towards Building Semantic Role Labeler forIndian Languages
Maaz Anwar and Dipti Sharma
We present a statistical system for identifying the semantic
relationships or semantic roles for two major Indian Languages,
Hindi and Urdu. Given an input sentence and a predicate/verb, the
system first identifies the arguments pertaining to that verb and
then classifies it into one of the semantic labels which can either
be a DOER, THEME, LOCATIVE, CAUSE, PURPOSE etc. The
system is based on 2 statistical classifiers trained on roughly
130,000 words for Urdu and 100,000 words for Hindi that were
hand-annotated with semantic roles under the PropBank project
for these two languages. Our system achieves an accuracy of
86% in identifying the arguments of a verb for Hindi and 75% for
Urdu. At the subsequent task of classifying the constituents into
their semantic roles, the Hindi system achieved 58% precision and
42% recall whereas Urdu system performed better and achieved
83% precision and 80% recall. Our study also allowed us to
compare the usefulness of different linguistic features and feature
combinations in the semantic role labeling task. We also examine
the use of statistical syntactic parsing as feature in the role labeling
task.
A Framework for Automatic Acquisition ofCroatian and Serbian Verb Aspect from CorporaTanja Samardzic and Maja Milicevic
Verb aspect is a grammatical and lexical category that encodes
temporal unfolding and duration of events described by verbs.
It is a potentially interesting source of information for various
computational tasks, but has so far not been studied in much depth
from the perspective of automatic processing. Slavic languages
are particularly interesting in this respect, as they encode aspect
through complex and not entirely consistent lexical derivations
involving prefixation and suffixation. Focusing on Croatian and
Serbian, in this paper we propose a novel framework for automatic
classification of their verb types into a number of fine-grained
aspectual classes based on the observable morphology of verb
forms. In addition, we provide a set of around 2000 verbs
classified based on our framework. This set can be used for
linguistic research as well as for testing automatic classification
on a larger scale. With minor adjustments the approach is also
applicable to other Slavic languages.
Monolingual Social Media Datasets for DetectingContradiction and EntailmentPiroska Lendvai, Isabelle Augenstein, Kalina Bontchevaand Thierry Declerck
Entailment recognition approaches are useful for application
domains such as information extraction, question answering
or summarisation, for which evidence from multiple sentences
needs to be combined. We report on a new 3-way judgement
Recognizing Textual Entailment (RTE) resource that originates in
the Social Media domain, and explain our semi-automatic creation
method for the special purpose of information verification, which
draws on manually established rumourous claims reported during
crisis events. From about 500 English tweets related to 70 unique
claims we compile and evaluate 5.4k RTE pairs, while continue
automatizing the workflow to generate similar-sized datasets in
other languages.
VoxML: A Visualization Modeling LanguageJames Pustejovsky and Nikhil Krishnaswamy
We present the specification for a modeling language, VoxML,
which encodes semantic knowledge of real-world objects
represented as three-dimensional models, and of events and
attributes related to and enacted over these objects.VoxML is
intended to overcome the limitations of existing 3D visual markup
languages by allowing for the encoding of a broad range of
semantic knowledge that can be exploited by a variety of systems
and platforms, leading to multimodal simulations of real-world
159
scenarios using conceptual objects that represent their semantic
values
Metonymy Analysis Using Associative Relationsbetween Words
Takehiro Teraoka
Metonymy is a figure of speech in which one item’s name
represents another item that usually has a close relation with the
first one. Metonymic expressions need to be correctly detected
and interpreted because sentences including such expressions
have different mean- ings from literal ones; computer systems
may output inappropriate results in natural language processing.
In this paper, an associative approach for analyzing metonymic
expressions is proposed. By using associative information
and two conceptual distances between words in a sentence, a
previous method is enhanced and a decision tree is trained to
detect metonymic expressions. After detecting these expressions,
they are interpreted as metonymic understanding words by
using associative information. This method was evaluated by
comparing it with two baseline methods based on previous
studies on the Japanese language that used case frames and
co-occurrence information. As a result, the proposed method
exhibited significantly better accuracy (0.85) of determining
words as metonymic or literal expressions than the baselines. It
also exhibited better accuracy (0.74) of interpreting the detected
metonymic expressions than the baselines.
Embedding Open-domain Common-senseKnowledge from Text
Travis Goodwin and Sanda Harabagiu
Our ability to understand language often relies on common-sense
knowledge – background information the speaker can assume
is known by the reader. Similarly, our comprehension of the
language used in complex domains relies on access to domain-
specific knowledge. Capturing common-sense and domain-
specific knowledge can be achieved by taking advantage of recent
advances in open information extraction (IE) techniques and,
more importantly, of knowledge embeddings, which are multi-
dimensional representations of concepts and relations. Building
a knowledge graph for representing common-sense knowledge
in which concepts discerned from noun phrases are cast as
vertices and lexicalized relations are cast as edges leads to
learning the embeddings of common-sense knowledge accounting
for semantic compositionality as well as implied knowledge.
Common-sense knowledge is acquired from a vast collection of
blogs and books as well as from WordNet. Similarly, medical
knowledge is learned from two large sets of electronic health
records. The evaluation results of these two forms of knowledge
are promising: the same knowledge acquisition methodology
based on learning knowledge embeddings works well both
for common-sense knowledge and for medical knowledge
Interestingly, the common-sense knowledge that we have acquired
was evaluated as being less neutral than than the medical
knowledge, as it often reflected the opinion of the knowledge
utterer. In addition, the acquired medical knowledge was
evaluated as more plausible than the common-sense knowledge,
reflecting the complexity of acquiring common-sense knowledge
due to the pragmatics and economicity of language.
Medical Concept Embeddings via LabeledBackground CorporaEneldo Loza Mencía, Gerard de Melo and Jinseok Nam
In recent years, we have seen an increasing amount of interest in
low-dimensional vector representations of words. Among other
things, these facilitate computing word similarity and relatedness
scores. The most well-known example of algorithms to produce
representations of this sort are the word2vec approaches. In
this paper, we investigate a new model to induce such vector
spaces for medical concepts, based on a joint objective that
exploits not only word co-occurrences but also manually labeled
documents, as available from sources such as PubMed. Our
extensive experimental analysis shows that our embeddings lead
to significantly higher correlations with human similarity and
relatedness assessments than previous work. Due to the simplicity
and versatility of vector representations, these findings suggest
that our resource can easily be used as a drop-in replacement
to improve any systems relying on medical concept similarity
measures.
Question-Answering with Logic Specific to VideoGamesCorentin Dumont, Ran Tian and Kentaro Inui
We present a corpus and a knowledge database aiming at
developing Question-Answering in a new context, the open world
of a video game. We chose a popular game called ‘Minecraft’,
and created a QA corpus with a knowledge database related to
this game and the ontology of a meaning representation that
will be used to structure this database. We are interested in
the logic rules specific to the game, which may not exist in the
real world. The ultimate goal of this research is to build a QA
system that can answer natural language questions from players
by using inference on these game-specific logic rules. The QA
corpus is partially composed of online quiz questions and partially
composed of manually written variations of the most relevant
ones. The knowledge database is extracted from several wiki-
like websites about Minecraft. It is composed of unstructured
data, such as text, that will be structured using the meaning
representation we defined, and already structured data such as
160
infoboxes. A preliminary examination of the data shows that
players are asking creative questions about the game, and that the
QA corpus can be used for clustering verbs and linking them to
predefined actions in the game.
P57 - Speech Corpora and Databases (2)Friday, May 27, 14:55
Chairperson: Satoshi Nakamura Poster Session
Mining the Spoken Wikipedia for Speech Dataand Beyond
Arne Köhn, Florian Stegen and Timo Baumann
We present a corpus of time-aligned spoken data of Wikipedia
articles as well as the pipeline that allows to generate such corpora
for many languages. There are initiatives to create and sustain
spoken Wikipedia versions in many languages and hence the data
is freely available, grows over time, and can be used for automatic
corpus creation. Our pipeline automatically downloads and aligns
this data. The resulting German corpus currently totals 293h of
audio, of which we align 71h in full sentences and another 86h of
sentences with some missing words. The English corpus consists
of 287h, for which we align 27h in full sentence and 157h with
some missing words. Results are publically available.
A Corpus of Read and Spontaneous Upper SaxonGerman Speech for ASR Evaluation
Robert Herms, Laura Seelig, Stefanie Münch andMaximilian Eibl
In this Paper we present a corpus named SXUCorpus which
contains read and spontaneous speech of the Upper Saxon German
dialect. The data has been collected from eight archives of
local television stations located in the Free State of Saxony.
The recordings include broadcasted topics of news, economy,
weather, sport, and documentation from the years 1992 to 1996
and have been manually transcribed and labeled. In the paper,
we report the methodology of collecting and processing analog
audiovisual material, constructing the corpus and describe the
properties of the data. In its current version, the corpus is
available to the scientific community and is designed for automatic
speech recognition (ASR) evaluation with a development set and
a test set. We performed ASR experiments with the open-
source framework sphinx-4 including a configuration for Standard
German on the dataset. Additionally, we show the influence of
acoustic model and language model adaptation by the utilization
of the development set.
Parallel Speech Corpora of Japanese DialectsKoichiro Yoshino, Naoki Hirayama, Shinsuke Mori,Fumihiko Takahashi, Katsutoshi Itoyama and Hiroshi G.Okuno
Clean speech data is necessary for spoken language processing,
however, there is no public Japanese dialect corpus collected
for speech processing. Parallel speech corpora of dialect are
also important because real dialect affects each other, however,
the existing data only includes noisy speech data of dialects
and their translation in common language. In this paper, we
collected parallel speech corpora of Japanese dialect, 100 read
speeches utterance of 25 dialect speakers and their transcriptions
of phoneme. We recorded speeches of 5 common language
speakers and 20 dialect speakers from 4 areas, 5 speakers
from 1 area, respectively. Each dialect speaker converted the
same common language texts to their dialect and read them.
Speeches are recorded with closed-talk microphone, using for
spoken language processing (recognition, synthesis, pronounce
estimation). In the experiments, accuracies of automatic speech
recognition (ASR) and Kana Kanji conversion (KKC) system are
improved by adapting the system with the data.
The TYPALOC Corpus: A Collection of VariousDysarthric Speech Recordings in Read andSpontaneous StylesChristine Meunier, Cecile Fougeron, Corinne Fredouille,Brigitte Bigi, Lise Crevier-Buchman, Elisabeth Delais-Roussarie, Laurianne Georgeton, Alain Ghio, ImedLaaridh, Thierry Legou, Claire Pillot-Loiseau and GillesPouchoulin
This paper presents the TYPALOC corpus of French Dysarthric
and Healthy speech and the rationale underlying its constitution.
The objective is to compare phonetic variation in the speech of
dysarthric vs. healthy speakers in different speech conditions
(read and unprepared speech). More precisely, we aim to
compare the extent, types and location of phonetic variation
within these different populations and speech conditions. The
TYPALOC corpus is constituted of a selection of 28 dysarthric
patients (three different pathologies) and of 12 healthy control
speakers recorded while reading the same text and in a more
natural continuous speech condition. Each audio signal has been
segmented into Inter-Pausal Units. Then, the corpus has been
manually transcribed and automatically aligned. The alignment
has been corrected by an expert phonetician. Moreover, the corpus
benefits from an automatic syllabification and an Automatic
Detection of Acoustic Phone-Based Anomalies. Finally, in order
to interpret phonetic variations due to pathologies, a perceptual
161
evaluation of each patient has been conducted. Quantitative data
are provided at the end of the paper.
A Longitudinal Bilingual Frisian-Dutch RadioBroadcast Database Designed for Code-SwitchingResearch
Emre Yilmaz, Maaike Andringa, Sigrid Kingma, JelskeDijkstra, Frits Van der Kuip, Hans Van de Velde, FrederikKampstra, Jouke Algra, Henk van den Heuvel and Davidvan Leeuwen
We present a new speech database containing 18.5 hours of
annotated radio broadcasts in the Frisian language. Frisian is
mostly spoken in the province Fryslan and it is the second official
language of the Netherlands. The recordings are collected from
the archives of Omrop Fryslan, the regional public broadcaster
of the province Fryslan. The database covers almost a 50-year
time span. The native speakers of Frisian are mostly bilingual
and often code-switch in daily conversations due to the extensive
influence of the Dutch language. Considering the longitudinal
and code-switching nature of the data, an appropriate annotation
protocol has been designed and the data is manually annotated
with the orthographic transcription, speaker identities, dialect
information, code-switching details and background noise/music
information.
The SI TEDx-UM speech database: a newSlovenian Spoken Language Resource
Andrej Zgank, Mirjam Sepesy Maucec and DarinkaVerdonik
This paper presents a new Slovenian spoken language resource
built from TEDx Talks. The speech database contains 242 talks
in total duration of 54 hours. The annotation and transcription of
acquired spoken material was generated automatically, applying
acoustic segmentation and automatic speech recognition. The
development and evaluation subset was also manually transcribed
using the guidelines specified for the Slovenian GOS corpus.
The manual transcriptions were used to evaluate the quality of
unsupervised transcriptions. The average word error rate for
the SI TEDx-UM evaluation subset was 50.7%, with out of
vocabulary rate of 24% and language model perplexity of 390.
The unsupervised transcriptions contain 372k tokens, where 32k
of them were different.
Speech Corpus Spoken by Young-old, Old-old andOldest-old Japanese
Yurie Iribe, Norihide Kitaoka and Shuhei Segawa
We have constructed a new speech data corpus, using the
utterances of 100 elderly Japanese people, to improve speech
recognition accuracy of the speech of older people. Humanoid
robots are being developed for use in elder care nursing homes.
Interaction with such robots is expected to help maintain the
cognitive abilities of nursing home residents, as well as providing
them with companionship. In order for these robots to interact
with elderly people through spoken dialogue, a high performance
speech recognition system for speech of elderly people is needed.
To develop such a system, we recorded speech uttered by 100
elderly Japanese, most of them are living in nursing homes,
with an average age of 77.2. Previously, a seniors’ speech
corpus named S-JNAS was developed, but the average age of
the participants was 67.6 years, but the target age for nursing
home care is around 75 years old, much higher than that of the
S-JNAS samples. In this paper we compare our new corpus with
an existing Japanese read speech corpus, JNAS, which consists of
adult speech, and with the above mentioned S-JNAS, the senior
version of JNAS.
Polish Rhythmic Database – New Resources forSpeech Timing and Rhythm AnalysisAgnieszka Wagner, Katarzyna Klessa and Jolanta Bachan
This paper reports on a new database – Polish rhythmic
database and tools developed with the aim of investigating timing
phenomena and rhythmic structure in Polish including topics such
as, inter alia, the effect of speaking style and tempo on timing
patterns, phonotactic and phrasal properties of speech rhythm
and stability of rhythm metrics. So far, 19 native and 12 non-
native speakers with different first languages have been recorded.
The collected speech data (5 h 14 min.) represents five different
speaking styles and five different tempi. For the needs of speech
corpus management, annotation and analysis, a database was
developed and integrated with Annotation Pro (Klessa et al., 2013,
Klessa, 2016). Currently, the database is the only resource for
Polish which allows for a systematic study of a broad range of
phenomena related to speech timing and rhythm. The paper
also introduces new tools and methods developed to facilitate the
database annotation and analysis with respect to various timing
and rhythm measures. In the end, the results of an ongoing
research and first experimental results using the new resources are
reported and future work is sketched.
An Extension of the Slovak Broadcast NewsCorpus based on Semi-Automatic AnnotationPeter Viszlay, Ján Staš, Tomáš Koctúr, Martin Lojka andJozef Juhár
In this paper, we introduce an extension of our previously released
TUKE-BNews-SK corpus based on a semi-automatic annotation
scheme. It firstly relies on the automatic transcription of the BN
data performed by our Slovak large vocabulary continuous speech
recognition system. The generated hypotheses are then manually
162
corrected and completed by trained human annotators. The
corpus is composed of 25 hours of fully-annotated spontaneous
and prepared speech. In addition, we have acquired 900 hours
of another BN data, part of which we plan to annotate semi-
automatically. We present a preliminary corpus evaluation that
gives very promising results.
Generating a Yiddish Speech Corpus, ForcedAligner and Basic ASR System for the AHEYMProject
Malgorzata Cavar, Damir Cavar, Dov-Ber Kerler and AnyaQuilitzsch
To create automatic transcription and annotation tools for the
AHEYM corpus of recorded interviews with Yiddish speakers in
Eastern Europe we develop initial Yiddish language resources that
are used for adaptations of speech and language technologies. Our
project aims at the development of resources and technologies
that can make the entire AHEYM corpus and other Yiddish
resources more accessible to not only the community of Yiddish
speakers or linguists with language expertise, but also historians
and experts from other disciplines or the general public. In
this paper we describe the rationale behind our approach, the
procedures and methods, and challenges that are not specific to
the AHEYM corpus, but apply to all documentary language data
that is collected in the field. To the best of our knowledge, this is
the first attempt to create a speech corpus and speech technologies
for Yiddish. This is also the first attempt to work out speech and
language technologies to transcribe and translate a large collection
of Yiddish spoken language resources.
163
Authors Index
AA R, Balamurali, 153Abad, Alberto, 4, 134, 138Abanmy, Nora, 142Abbas, Noorhan, 62Abbott, Rob, 154Abdelali, Ahmed, 12Abdulrahim, Dana, 148Abercrombie, Gavin, 20Abouammoh, Murad, 145Abouda, Lotfi, 131Abromeit, Frank, 154Acar, Elif Ahsen, 125Ackermann, Markus, 144Adda, Gilles, 49Adda-Decker, Martine, 112Adeel Nawab, Rao Muhammad, 63Adesam, Yvonne, 105Adolphs, Peter, 39Adouane, Wafia, 94Afantenos, Stergos, 36, 95Afli, Haithem, 34Aga, Rosa Tsegaye, 72Agic, Željko, 148Agirre, Eneko, 14, 30, 59, 97, 105Agnès, Frédéric, 144Agosti, Maristella, 155Ah-Pine, Julien, 81Aichinger, Philipp, 27Aizawa, Akiko, 132, 139Aizawa, Masao, 153Ajili, Moez, 25Akarun, Lale, 48Aker, Ahmet, 108, 145Akhtar, Md Shad, 94Al shargi, Faisal, 45Al Zaatari, Ayman, 152Al-Badrashiny, Mohamed, 140, 146Al-Dayel, Abeer, 142Al-Khalifa, Hend, 142Al-Khalil, Muhamed, 149Al-Sulaiti, Latifa, 62Al-Twairesh, Nora, 142Al-Yahya, Maha, 142Alageel, Sinaa, 142
Alagic, Domagoj, 58Alam, Firoj, 99Alba Castro, José Luis, 49Albogamy, Fahad, 52Aldabe, Itziar, 93Alegria, Iñaki, 14, 37, 77, 102Alex, Beatrice, 136Alghamdi, Ayman, 18, 62AlGhamdi, Fahad, 146Algra, Jouke, 162Alharbi, Ghada, 61Alhelbawy, Ayman, 56Alikaniotis, Dimitrios, 104AlMarwani, Nada, 146Almeida, Hayda, 19Almeida, José João, 93Alonso, Miguel A., 144Alqahtani, Sawsan, 126AlShenaifi, Nouf, 142Altuna, Begoña, 152Alvarez, Aitor, 106Alves, Ana, 118Aman, Frédéric, 48, 68Aman, Frederic, 52Amanova, Dilafruz, 4Amaral, Daniela, 71Amilevicius, Darius, 88Amitabh, Unnayan, 68Amsler, Michael, 100Anand, Pranav, 154Ananiadou, Sophia, 44, 63Andersson, Linda, 144Andersson, Marta, 61Andriamakaoly, Jérémy, 70Andringa, Maaike, 162Andrzejczuk, Anna, 91Anikina, Tatjana, 121Antoine, Jean-Yves, 131António Rodrigues, João, 21, 96Antonitsch, André, 71Antunes, Sandra, 112Anwar, Maaz, 82, 159Apidianaki, Marianna, 39Aranberri, Nora, 65, 102, 105Arauco, Alejandro, 78
164
Araujo, Lourdes, 35Arcan, Mihael, 2, 20Archer, Dawn, 91Ariga, Michiaki, 85Arimoto, Yoshiko, 75, 139Arndt, Natanael, 31Arndt, Timotheus, 31Aroyo, Lora, 41, 74, 137Arppe, Antti, 112Arsevska, Elena, 118Artola, Xabier, 140, 156Artstein, Ron, 71, 109Arzelus, Haritz, 106Asahara, Masayuki, 57Asano, Hisako, 22Asher, Nicholas, 36, 95Aslam, Saba, 28Asooja, Kartik, 15Athanasakou, Vasiliki, 63Attardi, Giuseppe, 58Attia, Mohammed, 124Atwell, Eric, 18, 62Auberge, Veronique, 52Aufrant, Lauriane, 53Augenstein, Isabelle, 159Augustinus, Liesbeth, 23, 123Auzina, Ilze, 27, 89Avgustinova, Tania, 145Avramidis, Eleftherios, 65Aziz, Wilker, 142Azpeitia, Andoni, 122
BBabych, Bogdan, 127Bachan, Jolanta, 162Baeza-Yates, Ricardo, 33Baisa, Vít, 29, 30, 97Balahur, Alexandra, 40Baldwin, Timothy, 10Balenciaga, Marina, 106Bali, Kalika, 57Banea, Carmen, 129Banjade, Rajendra, 42, 130Banski, Piotr, 98, 124Baptista, Jorge, 135Barackman, Casey, 36Barancikova, Petra, 123Barbagli, Alessia, 4Barbieri, Francesco, 137
Barbu Mititelu, Verginica, 87Bargmann, Sascha, 80Barker, Emma, 108Barras, Claude, 11, 49Barreaux, Sabine, 66Barreiro, Anabela, 44Bartie, Phil, 75Bartolini, Roberto, 88Bartosiak, Tomasz, 47, 91Barzdins, Guntis, 62Basile, Angelo, 98Basili, Roberto, 2Batanovic, Vuk, 93Bateman, Leila, 73Batista, Fernando, 134Batliner, Anton, 46Battistelli, Delphine, 131Baumann, Timo, 161Baumgartner Jr., William A., 97Baur, Claudia, 8Bayol, Clarisse, 52Bayyr-ool, Aziyana, 89Béchet, Frédéric, 36Bechet, Frederic, 153Becker, Alex, 50Bedjeti, Adriatik, 48Bedrick, Steven, 118Begum, Rafiya, 57Behera, Pitambar, 51Beijer, Lilian, 28Bejcek, Eduard, 18, 80Bekavac, Marko, 105Bekkadja, Slima, 48Bel, Núria, 31, 77, 96Bel, Nuria, 158Bell, Dane, 6, 103Bellot, Patrice, 126Beloki, Zuhaitz, 140, 156Beltrami, Daniela, 72Ben Abacha, Asma, 115Ben Jannet, Mohamed Ameur, 65Benikova, Darina, 144Benko, Vladimír, 147Bentivogli, Luisa, 122Bentz, Christian, 74Berard, Alexandre, 145Berkling, Kay, 111Bernard, Guillaume, 25
165
Bernotat, Jasmin, 120Bertero, Dario, 18Bertrand, Roxane, 76, 111Besacier, Laurent, 49, 133, 144, 145Besançon, Romaric, 67Beskow, Jonas, 153Bethard, Steven, 118, 131Betz, Simon, 61Bhat, Riyaz Ahmad, 82Bhattacharya, Pushpak, 81, 106Bhattacharyya, Pushpak, 22, 77, 94, 150Bhingardive, Sudha, 81, 106Biagioni, Stefania, 13Bianchi, Francesca, 91Bick, Eckhard, 37Biemann, Chris, 125, 144Bierkandt, Lennart, 132Bies, Ann, 32, 89, 129Bigenzahn, Wolfgang, 27Bigi, Brigitte, 76, 161Billawala, Youssef, 107Bingel, Joachim, 124Bittar, André, 136Bizer, Christian, 12Blache, Philippe, 54, 81Black, Alan W, 119Blain, Frédéric, 78Blanco, Eduardo, 132Bleicken, Julian, 115Bobillier Chaumon, Marc-Eric, 48Bod, Rens, 3Boella, Guido, 29Bogantes, Diana, 78Bonastre, Jean-françois, 25Bond, Francis, 84, 150Bonial, Claire, 137Bonneau, Anne, 46Bontcheva, Kalina, 40, 159Borchmann, Łukasz, 142Bordea, Georgeta, 15Boros, Tiberiu, 87Bosc, Tom, 43Bosco, Cristina, 56, 101Bott, Stefan, 79Bouakaz, Saïda, 48Bouamor, Dhouha, 80, 110Bouamor, Houda, 38, 64, 126Boudin, Florian, 66
Bougouin, Adrien, 66Bouhafs Hafsia, Asma, 41Bouma, Gerlof, 105Bourlon, Antoine, 76Bowden, Kevin, 36, 120Boye, Johan, 132Bozsahin, Cem, 125Braasch, Anna, 30Branco, António, 1, 21, 54, 96, 97, 105Brandes, Jasper, 100Brasoveanu, Adrian, 116Braunger, Patricia, 26Bredin, Hervé, 11, 49Brierley, Claire, 18, 62Bristot, Antonella, 71Broadwell, George Aaron, 39Brognaux, Sandrine, 134Brugman, Hennie, 44Brümmer, Martin, 116Bruneau, Pierrick, 11, 49Brunson, Mary, 146Buchner, Karolina, 107Budnik, Mateusz, 49Budzynska, Katarzyna, 135Buitelaar, Paul, 15, 20, 84Bunt, Harry, 110Burchardt, Aljoscha, 1, 65Burga, Alicia, 71Burghardt, Manuel, 70Burgos, Pepi, 111Burkhardt, Felix, 26Buscaldi, Davide, 128Busso, Lucia, 92Buttery, Paula, 74, 104
CCabeza-Pereiro, María del Carmen, 49Cabrio, Elena, 43Caines, Andrew, 74, 104Cajal, Sergio, 71Cakmak, Huseyin, 76Calixto, Iacer, 65Calvo, Arturo, 56Calzà, Laura, 72Calzolari, Nicoletta, 88, 156Camacho-Collados, José, 59Camelin, Nathalie, 11Camgöz, Necati Cihan, 48Campbell, Nick, 21, 119, 154
166
Campillos Llanos, Leonardo, 80, 110Campos, Marisa, 54Candeias, Sara, 27, 50Candito, Marie, 82, 131Capka, Tomáš, 88Cardeñoso-Payo, Valentín, 73Cardoso, Aida, 47Carl, Michael, 139Carlini, Roberto, 80Carlmeyer, Birte, 120Carlotto, Talvany, 140Carman, Mark James, 77Caroli, Frederico Tommasi, 72Carpenter, Jordan, 129Carrive, Jean, 70Carvalho, Paula, 44Caselli, Tommaso, 14, 41, 121, 137Cassidy, Steve, 156Castellucci, Giuseppe, 2Castilho, Sheila, 11Castillo, Carlos, 57Cavar, Damir, 138, 155, 163Cavar, Malgorzata, 138, 155, 163Cavazza, Marc, 60Cavicchio, Federica, 71Cebovic, Ines, 147Celebi, Arda, 104Celli, Fabio, 99Celorico, Dirce, 27Cermáková, Anna, 88Cerrato, Loredana, 21Çetinoglu, Özlem, 146Cettolo, Mauro, 122Chakrabarty, Abhisek, 89Chakraborty, Nilesh, 144Chalub, Fabricio, 31Chamberlain, Jon, 71Chanfreau, Agustin, 17Chang, Angel, 30Chang, Chung-Ning, 120Charlet, Delphine, 70Charnois, Thierry, 128Charton, Eric, 19Chaturvedi, Akshay, 89Chavernac, David, 118Che, Xiaoyin, 23Chen, Francine, 104Chen, Hsin-Hsi, 8, 36, 43
Chen, Huan-Yuan, 36Chen, Jiajun, 23Chen, Lei, 66Chen, Xi, 129Chen, Yan-Ying, 104Chen, Yun-Nung, 26, 109Cherry, Colin, 136Chiarcos, Christian, 51, 84, 141, 154Chiu, Billy, 133Chiu, Tin-Shing, 158Chlumská, Lucie, 88Cho, Kit, 39, 130Chodroff, Eleanor, 46Choi, Eunsol, 15Choi, Ho-Jin, 12Choi, Key-Sun, 12Cholakov, Kostadin, 1Chollet, Mathieu, 17Chorianopoulou, Arodami, 4Choudhury, Monojit, 57Choukri, Khalid, 16, 56, 156, 157Chowdhury, Shammur Absar, 5Christensen, Heidi, 69Christodoulopoulos, Christos, 141Chu, Chenhui, 22, 76, 102Cieri, Christopher, 16, 56, 73, 86, 87, 157Cimiano, Philipp, 84, 120, 121Cinkova, Silvie, 6, 29, 30, 138Ciobanu, Alina Maria, 114Ciravegna, Fabio, 78Claessen, Koen, 24Clare, Amanda, 60Claveau, Vincent, 41, 128Clematide, Simon, 34Cleve, Anthony, 147Cnossen, Fokie, 109Codina-Filba, Joan, 71Cohan, Arman, 28Cohen, K. Bretonnel, 97Coheur, Luisa, 10Cohn, Trevor, 151Collins, Kathryn J., 5Collovini, Sandra, 66, 71Colotte, Vincent, 46Conger, Kathryn, 32Cook, Paul, 10, 103Copestake, Ann, 43Corcoglioniti, Francesco, 31
167
Cordeiro, Silvio, 42Corrales-Astorgano, Mario, 73Correia, Rui, 10, 135Costa, Angela, 10Couillault, Alain, 55, 131Courtin, Antoine, 105Coutinho, Eduardo, 46Couto-Vale, Daniel, 125Crevier-Buchman, Lise, 161Croce, Danilo, 2Cruz, Hilaria, 138Cuadros, Montse, 3Cuba Gyllensten, Amaru, 12Cucchiarini, Catia, 28, 111Cucurullo, Sebastiana, 114Cunningham, Stuart, 69Curto, Pedro, 134Cvrcek, Václav, 88Cysouw, Michael, 83
Dda Costa Pereira, Célia, 10da Silva, João Carlos Pereira, 72Dabre, Raj, 102Daelemans, Walter, 56, 143Dagan, Ido, 109Dai, Xin-Yu, 23Daiber, Joachim, 23Daille, Béatrice, 41, 108, 151Daille, Beatrice, 66Damnati, Geraldine, 70Danforth, Douglas, 110Danieli, Morena, 153Daris, Roberts, 89Darwish, Kareem, 37Das, Amitava, 64Dash, Arnab, 109David, Jérôme, 84Dayrell, Carmen, 91de Carvalho, Rita, 54De Clercq, Orphee, 101de Juan, Paloma, 17De Kuthy, Kordula, 136de Marneffe, Marie-Catherine, 57de Melo, Gerard, 84, 160de Montcheuil, Gregoire, 54de Paiva, Valeria, 31de Ruiter, Laura, 61De Smedt, Koenraad, 80, 123
de Weerd, Harmen, 109Declerck, Thierry, 84, 159Dediu, Dan, 68Degaetano-Ortlieb, Stefania, 67Del Gratta, Riccardo, 88, 156del Pozo, Arantza, 106Del Tredici, Marco, 158Delais-Roussarie, Elisabeth, 161Deléglise, Paul, 36Dell’Orletta, Felice, 4Delli Bovi, Claudio, 59Demberg, Vera, 36Dembowski, Julia, 33Demir, Hakan, 19Demner-Fushman, Dina, 115, 130Demuynck, Kris, 133Den, Yasuharu, 153Denk-Linnert, Doris-Maria, 27Derczynski, Leon, 9, 128Derval, Mathieu, 70Deulofeu, José, 79DeVault, David, 5Dhuliawala, Shehzaad, 150Di Buccio, Emanuele, 155di Buono, Maria Pia, 25Di Caro, Luigi, 29Di Nunzio, Giorgio Maria, 155Diab, Mona, 124, 126, 140, 146Dias Cardoso, Pedro, 20Diaz, Alberto, 14Dick, Melanie, 99Dick, Michelle, 120Diewald, Nils, 124Dijkstra, Jelske, 162Dima, Emanuel, 86Dimitrov, Stefan, 7Dimitrova, Vanya, 154Dimou, Athanasia-Lida, 120Dinarelli, Marco, 20Dini, Luca, 136Dinu, Liviu P., 114, 117, 145DiPersio, Denise, 56, 86Dirix, Peter, 23Djemaa, Marianne, 131Do, Hyun-Woo, 12Dobrovoljc, Kaja, 54Doi, Syunya, 66Dojchinovski, Milan, 115, 116, 144
168
Dragoni, Mauro, 10Dras, Mark, 142Draxler, Christoph, 28, 134Drumond, Lucas, 72Druskat, Stephan, 132, 155Du, Jinhua, 1, 77Dubuisson Duplessis, Guillaume, 95Duclot, William, 68Dufour, Barbara, 118Duma, Daniel, 60Dumitrescu, Stefan Daniel, 87Dumont, Corentin, 160Dupont, Stéphane, 76Dutoit, Thierry, 76Dyer, Chris, 115Dyvik, Helge, 123
EEckart de Castilho, Richard, 29Eckart, Thomas, 97Ecker, Brian, 154Ecker, Stefan, 59Eckert, Kai, 12Eckle-Kohler, Judith, 74Edlund, Jens, 156Efthimiou, Eleni, 120Eger, Steffen, 53Ehrmann, Maud, 19, 116Eibl, Maximilian, 161Eichler, Kathrin, 117Eiselen, Roald, 24, 116Ekbal, Asif, 94, 130Ekenel, Hazim, 49El Ballouli, Rim, 152El Haddad, Kevin, 76El-Beltagy, Samhaa R., 101El-Haj, Mahmoud, 9, 63, 91El-Hajj, Wassim, 152ELbassouni, Shady, 152Elhadad, Michael, 150Elingui, Uriel Pascal, 133Ellendorff, Tilia, 129Elliott, Desmond, 48, 106Emerson, Guy, 43Emmery, Chris, 143Engelmann, Kai Frederic, 50, 120Enström, Ingegerd, 7Erdmann, Johnsey, 110Eriksson, Robin, 62
Erjavec, Tomaž, 53, 125Erro, Daniel, 26Escudero-Mancebo, David, 73Eshkol, Iris, 131Eshkol-Taravela, Iris, 130Eskander, Ramy, 45, 140Eskenazi, Maxine, 135España-Bonet, Cristina, 102Espinosa Anke, Luis, 80, 115Espinoza, Fredrik, 12Esplà-Gomis, Miquel, 103Estève, Yannick, 11, 36Etchegoyhen, Thierry, 122Etcheverry, Mathias, 127Etxeberria, Izaskun, 37Euzenat, Jérôme, 84Eyssel, Friederike, 120
FFaessler, Erik, 87Fairon, Cédrick, 8Falala, Sylvain, 118Falk, Ingrid, 42, 83Fandrych, Christian, 10Fang, Alex, 110Farah, Benamara, 95Farajian, M. Amin, 122Faralli, Stefano, 12Farzand, Omer, 28Fatema, Kaniz, 56Fäth, Christian, 154Fauth, Camille, 46Favre, Benoit, 11, 15, 153Fawei, Biralatei, 117Fazly, Afsaneh, 103Federico, Marcello, 122Feldman, Laurie, 39Fellbaum, Christiane, 39Feltracco, Anna, 74Ferguson, Emily, 73Fernández Barrera, Meritxell, 157Fernandez Rei, Elisa, 68, 134Fernandez, Raquel, 5Ferreira, Eduardo, 127Ferreira, Jaime, 134Ferrero, Jérémy, 144Ferret, Olivier, 67Ferrugento, Adriana, 118Figueira, Anny, 71
169
Finatto, Maria José Bocorny, 92Finch, Andrew, 55Fisas, Beatriz, 107Fischer, Andrea, 145Fischer, Stefan, 124Fišer, Darja, 125Flickinger, Dan, 138Flores-Lucas, Valle, 73Fohr, Dominique, 46, 133Fokkens, Antske, 41, 156Fomicheva, Marina, 96Fonseca, Evandro, 6, 71Forkel, Robert, 83Forsberg, Markus, 90Fort, Karën, 55Foster, Jonathan, 108Foster, Simon, 129Fothergill, Richard, 10Fotinea, Stavroula–Evita, 120Foucault, Nicolas, 105Fougeron, Cecile, 161Fournier, Sebastien, 126Fox Tree, Jean, 110, 120Fox Tree, Jean, 121Frain, Alice, 143Francisco, Virginia, 83Francois, Thomas, 8, 134Francopoulo, Gil, 12, 49, 65Frank, Anette, 105Frankenberg, Claudia, 25Fredouille, Corinne, 69, 161Freitag, Dayne, 72Freitas, André, 72Freitas, Bianca, 152Freitas, Cláudia, 152Freitas, Maria João, 47Frick, Elena, 10, 98Fried, Daniel, 103Frieder, Ophir, 104Frontini, Francesca, 33, 88Füchsel, Silke, 4Fujita, Akira, 96Fulgoni, Dean, 129Funakoshi, Kotaro, 110Fünfer, Sarah, 64Fung, Pascale, 18, 68Funk, Adam, 15, 108Furrer, Lenz, 34
GGábor, Kata, 128Gabryszak, Aleksandra, 84Gagliardi, Gloria, 72Gaizauskas, Robert, 15, 106, 108Galibert, Olivier, 65Galvan, Paloma, 83Gamallo, Pablo, 102Gambäck, Björn, 64Ganguly, Debasis, 65Ganguly, Niloy, 57Ganzeboom, Mario, 28Gao, Jie, 78Garain, Utpal, 89García Mateo, Carmen, 49García-Mateo, Carmen, 68, 134García Pablos, Aitor, 3Garcia, Marcos, 116García-Miguel, José M., 49Garnier, Marie, 7Gaspari, Federico, 157Gast, Volker, 132, 155Gaudio, Rosa, 1Gauthier, Elodie, 133Geoffrois, Edouard, 9Georg, Gersende, 60Georgeton, Laurianne, 161Georgiladakis, Spiros, 42Gerlach, Johanna, 8Ghaddar, Abbas, 5Ghannay, Sahar, 11Ghidoni, Enrico, 72Ghio, Alain, 161Ghoneim, Mahmoud, 124, 126, 146Giannini, Silvia, 13Gibbon, Dafydd, 113Gilmartin, Emer, 154Ginter, Filip, 57, 82Ginzburg, Jonathan, 61Girard-Rivier, Maxence, 52Gkatzia, Dimitra, 75Glaser, Elvira, 140Gleim, Rüdiger, 53Gobert, Maxime, 147Godfrey, John, 46Goeuriot, Lorraine, 67Goggi, Sara, 13Goharian, Nazli, 28, 104
170
Gokcen, Ajda, 110Goldberg, Yoav, 57Gomes, Luís, 78, 97Gómez Guinovart, Xavier, 93Gomez, Randy, 76Gómez-Rodríguez, Carlos, 144Gonçalo Oliveira, Hugo, 102, 118, 150Gonçalves, Anabela, 112González Saavedra, Berta, 90Gonzàlez, Meritxell, 16González-Ferreras, César, 73Goodman, Michael Wayne, 43Goodwin, Travis, 160Gorisch, Jan, 111Gornostaja, Tatjana, 144Gosko, Didzis, 62Götze, Jana, 132Goulas, Theodore, 120Goutte, Cyril, 62Goyal, Kartik, 115Grabar, Natalia, 92, 130Gracia, Jorge, 31, 84Graff, David, 147Graham, Calbert, 74Gralinski, Filip, 142Granvogl, Daniel, 70Green, Phil, 69Greenwood, Mark A., 128Grefenstette, Gregory, 47Griffitt, Kira, 129Grimes, Stephen, 32, 102Grishman, Ralph, 20Grouas, Thibault, 157Grouin, Cyril, 70, 124, 141Grover, Claire, 136Grzitis, Normunds, 89Guerraz, Aleksandra, 70Guillou, Erwan, 48Guillou, Liane, 22Gulordava, Kristina, 99Gupta, Palash, 6Gurevych, Iryna, 29, 32, 74, 105Gurrutxaga, Antton, 113Gustafson, Joakim, 156Gutiérrez-González, Yurena, 73Gutierrez-Vasques, Ximena, 146Gutkin, Alexander, 69
HH. Arai, Noriko, 96Ha, Linne, 69Haaf, Susanne, 151Habash, Nizar, 38, 45, 64, 126, 140, 148, 149,
152Habernal, Ivan, 32, 74HaCohen-Kerner, Yaakov, 19Hagen, Kristin, 51Hagmüller, Martin, 27Hahn, Udo, 87, 88Hahn-Powell, Gus, 6, 11Hain, Thomas, 61, 69Hajic, Jan, 55, 57, 105, 138, 149Hajj, Hazem, 152Hajnicz, Elzbieta, 47, 91Hakkani-Tur, Dilek, 26Halabi, Nawar, 25Halfaker, Aaron, 45Hamfors, Ola, 12Hamon, Thierry, 92Han, Jingyi, 77Han, Qi, 42Hanbury, Allan, 144Handschuh, Siegfried, 72Hangya, Viktor, 100Hanke, Thomas, 115Hanl, Michael, 124Hansen, Dorte Haltrup, 87Hantke, Simone, 46, 75Harabagiu, Sanda, 160Harashima, Jun, 74, 85Hardmeier, Christian, 22Harige, Ravindra, 84Hartmann, Silvana, 105Hasanuzzaman, Mohammed, 130Hasida, Koiti, 141Hassan, Sara, 148Hateva, Neli, 27Hathout, Nabil, 38, 47Hätty, Anna, 79Haugereid, Petter, 123Hawwari, Abdelati, 124, 126, 146Hayakawa, Akira, 21Hayashi, Yoshihiko, 43, 91Hayoun, Avi, 150Hazem, Amir, 108, 145He, Yifan, 20
171
He, Yulan, 143Hedberg, Karin, 105Hedeland, Hanna, 10Heid, Ulrich, 99Hellmann, Sebastian, 84, 116, 144Hellrich, Johannes, 87Hendrickx, Iris, 1Hendrikx, Pascal, 118Hennig, Leonhard, 84, 117Henriksen, Lina, 16Hensler, Andrea, 30Hepple, Mark, 108, 151Hermann, Thomas, 120Herms, Robert, 161Hernaez, Inma, 26Hernández Farías, Delia Irazú, 101Hernandez Pompa, Isaac, 146Hernandez, Nicolas, 61Hernando, Javier, 49Hersh, William, 118Hervas, Raquel, 14, 83Hicks, Davyth, 113Higashinaka, Ryuichiro, 110Hirayama, Naoki, 161Hládek, Daniel, 66Hladka, Barbora, 83Hnátková, Milena, 88Hoenen, Armin, 73, 148Hofmann, Hansjörg, 26Hohle, Petter, 55Hokamp, Chris, 127Hollenstein, Nora, 137Hollink, Laura, 48Holst, Anders, 12Holthaus, Patrick, 50, 120Homburg, Timo, 141Hongchao, Liu, 159Hönig, Florian, 46Horbach, Andrea, 7, 30, 59Horsmann, Tobias, 148Horvat, Matic, 15, 43Hoste, Véronique, 12Hoste, Veronique, 62, 101Hough, Julian, 5, 61Hovy, Dirk, 104Hovy, Eduard, 45Htait, Amal, 126Hu, Junfeng, 51
Hu, Zhichao, 120Hua, Zhenhao, 109Huang, Chu-Ren, 79, 139, 158, 159Huang, Hen-Hsen, 36Huang, Shujian, 23Huangfu, Luwen, 103Hubert, Isabell, 112Huck, Matthias, 1Huet, Stéphane, 126Hulden, Mans, 37, 90Humayoun, Muhammad, 28, 128Hunter, Julie, 95Hupkes, Dieuwke, 3Husic, Halima, 98Huygen, Paul, 156
IIde, Nancy, 16, 87Idiart, Marco, 80Ijuin, Koki, 147Iliakopoulou, Aikaterini, 17Iliash, Anna, 10Ilievski, Filip, 19, 151Illina, Irina, 133Imada, Takakazu, 66Imran, Muhammad, 57Inaba, Michimasa, 110Indig, Balázs, 84Inel, Oana, 121, 137Inoue, Masashi, 95Inoue, Yusuke, 66Inui, Kentaro, 160Ioki, Masayuki, 85Iosif, Elias, 4, 42, 100, 137Iribe, Yurie, 162Irimia, Elena, 87Isahara, Hitoshi, 77Isard, Amy, 60Ishida, Mitsuru, 147Ishida, Toru, 114, 154Itoyama, Katsutoshi, 161Ivanova, Angelina, 138Izquierdo, Ruben, 59
JJabaian, Bassam, 126Jackl, Bernhard, 134Jacquet, Guillaume, 19Jacquey, Evelyne, 151
172
Jadi, Grégoire, 41Jaffe, Evan, 110Jagrova, Klara, 145Jaimes, Alejandro, 17Jain, Rohit, 60Jakubicek, Milos, 97Janier, Mathilde, 35Jansche, Martin, 69Janssen, Maarten, 112, 139Jaquette, Daniel, 86Jauch, Ronny, 99Jazbec, Ivo-Pavao, 148Jean-Louis, Ludovic, 19Jelínek, Tomáš, 88Jeong, Young-Seob, 12Jettka, Daniel, 10Jezek, Elisabetta, 74Jha, Girish, 51Jha, Rahul, 17Ji, Donghong, 30Jiménez, Ricardo-María, 91Jimeno Yepes, Antonio, 102Johannessen, Janne M, 51Johannsen, Anders, 30, 104Johansson, Richard, 94, 105Jones, Dewi Bryn, 113Jones, Gareth, 65Jones, Karen, 147Jonquet, Clement, 58Joo, Won-Tae, 12Joscelyne, Andrew, 16Joshi, Aditya, 77Jouvet, Denis, 46Jügler, Jeanin, 46Juhár, Jozef, 66, 162Juhn, Young, 15Junczys-Dowmunt, Marcin, 122Jung, Manuel, 128Jurgens, David, 7
KKındıroglu, Ahmet Alp, 48Kaalep, Heiki-Jaan, 85Kabadjov, Mijail, 29Kabashi, Besim, 149Kachkovskaia, Tatiana, 67Kahn, Juliette, 25, 65Kalamboukis, Theodore, 13Kameko, Hirotaka, 49
Kaminski, Steve, 86Kamocki, Pawel, 88, 154Kampstra, Frederik, 162Kanayama, Hiroshi, 57Kanojia, Diptesh, 77, 150Kaplan, Aidan, 45Kaplan, Dain, 152Karabüklü, Serpil, 48Karkaletsis, Vangelis, 156Karlgren, Jussi, 12Kashyap, Laxmi, 106Katakis, Ioannis Manousos, 156Katayama, Taichi, 22Katerenchuk, Denys, 127Kato, Akihiko, 58Kato, Tsuneo, 9Kato, Yoshihide, 54Katris, Nikolaos, 13Kattenberg, Mathijs, 156Kawada, Yasuhide, 66Kawasaki, Yoshifumi, 3Keiper, Lena, 7Kelepir, Meltem, 48Kelly, Liadh, 67Kemmerer, Steffen, 39Kemps-Snijders, Marc, 86Kennington, Casey, 5Kepler, Fabio, 50Kerler, Dov-Ber, 163Kermanidis, Katia Lida, 1Kermes, Hannah, 67Kettnerová, Václava, 18Kettunen, Kimmo, 33Khalfi, Mustapha, 33Khalifa, AlBara, 9Khalifa, Salam, 38, 148Khamis, Ashraf, 67Khan, Fahad, 33, 88Khan, R. A., 48Khan, Tafseer Ahmed, 82Khashabi, Daniel, 141Khemakhem, Mohamed, 29Khiari, Wejdene, 41Khudanpur, Sanjeev, 46Khvtisavrishvili, Nana, 79Kieras, Witold, 90Kijak, Ewa, 128Kilicoglu, Halil, 115
173
Kingma, Sigrid, 162Kiritchenko, Svetlana, 2, 40, 136Kirov, Christo, 108, 109Kisler, Thomas, 28, 134Kiss, Tibor, 98Kitaoka, Norihide, 162Klakow, Dietrich, 4, 145Klang, Marcus, 143Klassen, Prescott, 119Klein, Ewan, 60Klejch, Ondrej, 65Klenner, Manfred, 100Kleppe, Martijn, 106Klessa, Katarzyna, 162Kliegr, Tomas, 115Klimek, Bettina, 31, 84Klinger, Roman, 39Kloppenburg, Lennart, 98Klubicka, Filip, 103, 148Klyueva, Natalia, 100Knappen, Jörg, 67Knight, Dawn, 91Knight, Kevin, 15Kobayashi, Yuka, 110Kobourov, Stephen, 103Koch, Steffen, 42Kochanowski, Bartłomiej, 99Kocharov, Daniil, 67Koctúr, Tomáš, 162Kohl, Matt, 16Köhn, Arne, 161Koidl, Kevin, 144Koiso, Hanae, 153Kolcz, Alek, 104Komachi, Mamoru, 13Konat, Barbara, 135Konovalov, Vasily, 109Köper, Maximilian, 42, 90Kordjamshidi, Parisa, 141Kordoni, Valia, 1Korkontzelos, Yannis, 44, 63Kornai, Andras, 98Korpusik, Mandy, 104Köster, Norman, 120Koto, Fajri, 28Kousidis, Spyros, 61Koutsakis, Polychronis, 100Koutsombogera, Maria, 120
Kovár, Vojtech, 14Kováríková, Dominika, 88Krause, Sebastian, 31, 84, 117Krause, Thomas, 155Kraut, Robert, 45Krejcová, Ema, 29, 30Kren, Michal, 88, 91Krenn, Brigitte, 49Krieg-Holz, Ulrike, 88Krilavicius, Tomas, 88Krisch, Jennifer, 99Krishnaswamy, Nikhil, 159Kríž, Vincent, 83Krome, Sabine, 30Krstev, Cvetana, 18Kruschwitz, Udo, 29, 56, 71Ku, Lun-Wei, 94Kuhlmann, Marco, 138Kuhn, Jonas, 6, 149Kuhnle, Alexander, 43Kulick, Seth, 89Kummert, Franz, 120Kunz, Kerstin Anna, 34Kuo, Chung-Lun, 43Kupietz, Marc, 124Kuras, Christoph, 97Kurfalı, Murathan, 125Kurfürst, Dennis, 50Kurohashi, Sadao, 22, 76, 77, 102Kurtic, Emina, 108Kutuzov, Andrey, 106Kuvac Kraljevic, Jelena, 112Kuzmenko, Elizaveta, 106Kyaw Thu, Ye, 55Kyuseva, Maria, 43
LLaaridh, Imed, 69, 161Labaka, Gorka, 77Lachler, Jordan, 112Lafourcade, Mathieu, 158Lai, Mirko, 56Lailler, Carole, 36Lam, Sam, 133Lancelot, Renaud, 118Landeau, Anaïs, 36Lane, Caoilfhionn, 20Lanfrey, Damien, 14Langlais, Phillippe, 5
174
Lanser, Bettina, 121Laparra, Egoitz, 51, 93Laprie, Yves, 46Lapshinova-Koltunski, Ekaterina, 34Laur, Sven, 85Laura, Monceaux, 41Lawrence, John, 135Lazic, Biljana, 18Le, Dieu-Thu, 14Le, Ha, 53Lebani, Gianluca, 33Lecouteux, Benjamin, 48, 68Lee, Annie, 103Lee, John, 37, 58Lefeuvre-Halftermeyer, Anaïs, 131Lefever, Els, 12, 62Lefevre, Fabrice, 126Léger, Serge, 62Legou, Thierry, 161Leichsenring, Christian, 120Lejeune, Gaël, 151Lenci, Alessandro, 33, 76, 92, 158Lendvai, Piroska, 159Leonhard, Matthias, 27Leser, Ulf, 39Lesnikova, Tatiana, 84Letard, Vincent, 95Levchik, Anatolii, 40Levin, Lori, 115Lewis, David, 56Li, Claire, 133Li, Junyi Jessy, 135Li, Minglei, 64, 133Li, Wenjie, 64Li, Xuansong, 32, 102Liakata, Maria, 60, 142Liao, Wan-Shan, 36Liberman, Mark, 73Libovický, Jindrich, 24Liddy, Elizabeth D., 40Liebeskind, Chaya, 19Lien, John, 39Lier, Florian, 120Liew, Jasy Suet Yan, 40Ligozat, Anne-Laure, 8, 80, 95Lim, Chae-Gyun, 12Limburská, Adéla, 45Lin, Donghui, 154
Lison, Pierre, 32List, Johann-Mattis, 83Listenmaa, Inari, 24Littell, Patrick, 115Little, Alexa, 81Liu, Hongfang, 15, 118Liu, Kris, 110, 120, 121Liu, Lin, 56, 156, 157Liu, Qun, 77, 95Liu, Ting, 39, 130Liu, Wuying, 37Liu, Yang, 35Liyanapathirana, Jeevanthi, 78Ljubešic, Nikola, 53, 103, 112, 125, 148Llewellyn, Clare, 136Llozhi, Lorena, 7Loáiciga, Sharid, 99Löfberg, Laura, 91Logacheva, Varvara, 78, 127Loginova Clouet, Elizaveta, 61Lojka, Martin, 162Long, Yunfei, 64Lopes, Carla, 27Lopes, José, 4Lopez de Lacalle, Maddalen, 93Lopez de Lacalle, Oier, 97Lopez, Cédric, 39Losnegaard, Gyri Smørdal, 80, 123Lossio-Ventura, Juan Antonio, 58Loudcher, Sabine, 81Louka, Katerina, 4Loukachevitch, Natalia, 40Lovick, Olga, 114Lowe, John B., 122Loza Mencía, Eneldo, 160Lu, Jing, 138Lu, Qin, 133, 158Lu, Yanan, 30Lubis, Nurul, 76Lucisano, Pietro, 4Luecking, Andy, 50, 148Lukin, Stephanie, 36Lundkvist, Peter, 7Luo, Wentao, 43Lupu, Mihai, 144Lusicky, Vesna, 16Luz, Saturnino, 21Lyding, Verena, 86
175
Lyse, Gunn Inger, 123
MMaamouri, Mohamed, 32Machado, Gabriel, 66Maciejewski, Matthew, 46Mackaness, William, 75Maegaard, Bente, 16Magnani, Romain, 52Magnini, Bernardo, 74Magnolini, Simone, 74Maharjan, Nabin, 42Mahlow, Cerstin, 100Maier, Wolfgang, 144Makrai, Márton, 96Maks, Isa, 41Malchanau, Andrei, 109, 110Malcuori, Marisa, 72Maldonado, Alfredo, 56Malmasi, Shervin, 62, 142Mamede, Nuno, 135Mamprin, Sara, 59Manishina, Elena, 126Mankoff, Robert, 17Mannens, Erik, 144Manning, Christopher D., 30, 57, 82Manuvinakurike, Ramesh, 5Mapelli, Valérie, 16, 56, 157Marcello, Norina, 72Marchi, Erik, 75Marciniak, Malgorzata, 79Marcu, Daniel, 15Marecek, David, 4Marg, Lena, 1Margaretha, Eliza, 124Mariani, Joseph, 12, 49, 65Marti, Roland, 145Martin, Fabienne, 42Martin, James H., 122Martínez Alonso, Héctor, 30Martinez Calvo, Adela, 68Martinez Garcia, Eva, 102Martínez Martínez, José Manuel, 78Martinez, Marta, 68, 134Martínez-Hinarejos, Carlos-D., 106Martinez-Romo, Juan, 35Martins de Matos, David, 134Marton, Yuval, 23Massimo, Poesio, 56
Matamala, Anna, 106Matos, Miguel, 138Matsubara, Shigeki, 54Matsumoto, Yuji, 57, 58Matsuo, Yoshihiro, 22Matsuzaki, Takuya, 96Matthies, Franz, 87, 88Maurel, Denis, 131Mauri, Marcel, 50Maxwell, Mike, 157May, Jonathan, 15Maynard, Diana, 40, 128Mazo, Hélène, 16Mazura, Margaretha, 16McCrae, John Philip, 84McDonald, Ryan, 57Medved’, Marek, 97Megyesi, Beata, 111Mehdad, Yashar, 107Mehler, Alexander, 50, 53, 148Meißner, Cordula, 10Meinel, Christoph, 23Melamud, Oren, 109Melero, Maite, 31Melese, Michael, 133Mella, Odile, 46Melo, Luis Felipe, 151Mendes, Amália, 112Mendes, Pablo, 151Mendez, Gonzalo, 83Menini, Stefano, 15Metaxas, Dimitris, 107Meunier, Christine, 69, 161Meurant, Laurence, 147Meurer, Paul, 123Meurers, Detmar, 136Meurs, Marie-Jean, 19Meusel, Robert, 12Meyer zu Borgsen, Sebastian, 120Michelfeit, Jan, 97Mihalcea, Rada, 129Miháltz, Márton, 84Mihov, Stoyan, 27Mikulová, Marie, 6Milicevic, Maja, 159Miller, Tristan, 29Milosavljevic, Milan, 93Minard, Anne-Lyse, 51, 152
176
Minker, Wolfgang, 3, 63Mírovský, Jirí, 6, 61Mirzaei, Azadeh, 132Mirzaei, Mehrdad, 130Misra Sharma, Dipti, 6Mitankin, Petar, 27Mitkov, Ruslan, 10, 17Mitra, Prasenjit, 57Miwa, Makoto, 44Miyao, Yusuke, 53, 57, 132, 138Möbius, Bernd, 46Mociariková, Monika, 14Modi, Ashutosh, 121Moe, Lwin, 155Moens, Marie-Francine, 49Mohammad, Saif, 2, 40, 136Mohit, Behrang, 64Mohler, Michael, 146Moisik, Scott, 68Mojica de la Vega, Luis Gerardo, 152Mokaddem, Sidahmed, 2Moloodi, Amirsaeid, 132Monachini, Monica, 33, 88Moniz, Helena, 4, 134Montcheuil, Grégoire, 81Montemagni, Simonetta, 4, 114Monti, Johanna, 80Moore, Andrew, 63Moran, Steven, 84, 153Morante, Roser, 41Moreira, André, 86Morency, Louis-Philippe, 17Moretti, Giovanni, 14Morey, Mathieu, 95Morgado da Costa, Luís, 150Mori, Hiroki, 139Mori, Shinsuke, 23, 47, 49, 57, 161Morin, Emmanuel, 145Morlane-Hondère, François, 70Morros, Ramon, 49Mortensen, David R., 115Mostafa, Naziba, 68Mota, Cristina, 44Motlani, Raveesh, 90Mott, Justin, 129Mrabet, Yassine, 115Mubarak, Hamdy, 12, 37Mudraya, Olga, 91
Muischnek, Kadri, 54Mujadia, Vandan, 6Mújdricza-Maydt, Éva, 105Müller, Markus, 64Muller, Philippe, 131Münch, Stefanie, 161Murakami, Yohei, 114Murata, Kenta, 85Murawaki, Yugo, 47Muszynska, Ewa, 43Müürisep, Kaili, 54Muzaffar, Sharmin, 51Mykowiecka, Agnieszka, 79
NNabi, Hakim, 70Nagaoka, Atsushi, 139Nagata, Ryo, 3Nahli, Ouafae, 33Nakadai, Kazuhiro, 76Nakaguchi, Takao, 154Nakamura, Keisuke, 76Nakamura, Satoshi, 69, 76Nakazawa, Toshiaki, 76, 77Nam, Jinseok, 160Namer, Fiammetta, 38Naskar, Debashis, 2Naskar, Sudip Kumar, 21Näsman, Jesper, 111Nasr, Alexis, 79Nasution, Arbi Haza, 114Navarretta, Costanza, 17Navarro, Borja, 151Navas, Eva, 26Navigli, Roberto, 59Nawab, Rao Muhammad Adeel, 28, 91Nayak, Tapas, 21Nazar, Rogelio, 52Neale, Steven, 97, 105Nedoluzhko, Anna, 6, 34Neergaard, Karl, 139, 159Neff, Michael, 120, 121Neidle, Carol, 107Nemeskey, Dávid Márk, 98Nenkova, Ani, 135Neubig, Graham, 69Neudecker, Clemens, 150Neumann, Stella, 125Névéol, Aurélie, 102
177
Neves, Mariana, 102Ng, Vincent, 138, 152Ng, Vincent T.Y., 133Nguyen, Kiem-Hieu, 67Nguyen, Ngan, 53Nguyen, Ngoc, 154Nguyen, Quy, 53Ní Chasaide, Ailbhe, 119Ní Chiaráin, Neasa, 119Nicolao, Mauro, 69Nie, Tian, 66Niekrasz, John, 72Niemietz, Paula, 125Nikolic, Boško, 93Nimb, Sanni, 30Niraula, Nobal Bikram, 42Nisioi, Sergiu, 117, 118, 145Nissim, Malvina, 98Niton, Bartłomiej, 47Nivre, Joakim, 54, 57, 82Nixon, Lyndon J.B., 116Noferesti, Samira, 94Nordhoff, Sebastian, 114, 140Nöth, Elmar, 46Nouri, Javad, 108Nouvel, Damien, 116Novák, Attila, 45Novák, Michal, 6Nugues, Pierre, 143
OÓ Droighneáin, Eoin, 20O’Brien, Sharon, 11O’Daniel, Bridget, 135O’Regan, Jim, 154Obeid, Ossama, 64, 126Oberlander, Jon, 136Obradovic, Ivan, 18Odijk, Jan, 86Oellrich, Anika, 142Oepen, Stephan, 138Offersgaard, Lene, 87Oflazer, Kemal, 64, 126Ohta, Tomoko, 132Ohya, Kazushi, 113Okanoya, Kazuo, 75Okuno, Hiroshi G., 161Okur, Eda, 19Olsen, Sussi, 16, 30
Olsson, Fredrik, 12Onaindia, Eva, 2Onambele, Christophe, 90Oostdijk, Nelleke, 35Oramas, Sergio, 115Orasmaa, Siim, 85Oravecz, Csaba, 45Orizu, Udochukwu, 143Ortiz Rojas, Sergio, 103Osella, Michele, 144Osenova, Petya, 80, 84, 105Ostermann, Simon, 121Otegi, Arantxa, 105Otrusina, Lubomir, 114Outahajala, Mohamed, 149Øvrelid, Lilja, 55Özates, Saziye Betül, 99Özbal, Gözde, 131Özgür, Arzucan, 19, 99, 104Özsoy, Ayse Sumru, 48Ozturel, Adnan, 61
PPa Pa, Win, 55Pääkkönen, Tuula, 33Paetzold, Gustavo, 107Paikens, Pteris, 89Pajkossy, Katalin, 148Pal, Santanu, 21Palmér, Anne, 111Palmer, Martha, 32, 82, 137Palmero Aprosio, Alessio, 31Palogiannidi, Elisavet, 4, 100Palotti, Joao, 67Pan, Jeff, 117Panchenko, Alexander, 92Papavassiliou, Vassilis, 32Paperno, Denis, 43Pappu, Aasish, 17Paramita, Monica, 108Pardelli, Gabriella, 13, 88Pareja-Lora, Antonio, 84Pareti, Silvia, 61, 135Parish-Morris, Julia, 73Park, Joonsuk, 135Park, SoHyun, 103Parker, Jonathan, 131Paroubek, Patrick, 12, 65Parra Escartín, Carla, 80
178
Parvizi, Artemis, 16Pasha, Arfath, 140Passaro, Lucia C., 76Passarotti, Marco, 24, 90Patti, Viviana, 56, 101Paulheim, Heiko, 12, 151Pawar, Dipawesh, 130Pedersen, Bolette, 30Pedersen, Ted, 118Peldszus, Andreas, 36Pelemans, Joris, 133Pelletier, Francis Jeffry, 98Perdigão, Fernando, 27Pereira Lopes, Gabriel, 78Pereira, Rita, 105Pérez, Naiara, 122Perret, Jérémy, 36Pershina, Maria, 20Persson, Per, 12Pessentheiner, Hannes, 27Petasis, Georgios, 156Peters, Wim, 13Petkevic, Vladimír, 88Petmanson, Timo, 85Petrov, Slav, 57Petukhova, Volha, 4, 109, 110Piao, Scott, 91Pichler, Thomas, 27Pietquin, Olivier, 145Pilán, Ildikó, 7, 8Pillot-Loiseau, Claire, 161Pincus, Eli, 95Pinkal, Manfred, 30, 121Pinnis, Marcis, 27Pipatsrisawat, Knot, 69Piper, Andrew, 7Piperidis, Stelios, 32Plancq, Clément, 66Plank, Barbara, 56Plu, Julien, 19, 151Podlaska, Katarzyna, 83Poesio, Massimo, 29, 71Pohling, Marian, 120Poibeau, Thierry, 66Poignant, Johann, 11, 49Poláková, Lucie, 61Poletto, Cecilia, 155Polzehl, Tim, 74
Ponti, Edoardo Maria, 24Ponzetto, Simone Paolo, 12Pool, Jonathan, 84Popel, Martin, 65, 105Popescu, Octavian, 117Popescu, Vladimir, 16, 56, 156, 157Popescu-Belis, Andrei, 78Popovic, Maja, 2, 65Poppek, Johanna Marie, 98Pörner, Nina, 28, 134Portet, François, 48, 68Postma, Marten, 59Potamianos, Alexandros, 4, 42, 100, 137Pouchoulin, Gilles, 161Pouli, Vasiliki, 20Pouliquen, Bruno, 122Povlsen, Claus, 16Prabhakaran, Vinodkumar, 70Prange, Jakob, 30Preotiuc-Pietro, Daniel, 129, 151Pretkalnina, Lauma, 89Prévot, Laurent, 33, 54, 111Procházka, Pavel, 88Proença, Jorge, 27Proisl, Thomas, 149Prokopidis, Prokopis, 32Prys, Delyth, 113Prys, Gruffudd, 113Puolakainen, Tiina, 54Pustejovsky, James, 16, 87, 159Pyysalo, Sampo, 57, 132
QQasemiZadeh, Behrang, 64Qin, Lu, 64Qiu, Zhengwei, 34Quasthoff, Uwe, 14, 97Que, Roger, 109Quénot, Georges, 49Querido, Andreia, 21, 54, 96Quilitzsch, Anya, 163Quochi, Valeria, 113
RRabadan, Adrian, 14Rabinovich, Ella, 145Rademaker, Alexandre, 31Radev, Dragomir, 17, 99, 107Raganato, Alessandro, 59
179
Ramadier, Lionel, 158Rambelli, Giulia, 33Rambow, Owen, 45, 70, 140Ramisch, Carlos, 42, 79Ramsay, Allan, 52Ramshaw, Lance, 32Rauschenberger, Maria, 4Rauzy, Stéphane, 54, 81Ravenscroft, James, 60, 142Ray, Jessica, 11Rayner, Manny, 8Rayson, Paul, 9, 63, 91Read, Jonathon, 60Real, Livy, 31Rebollo, Miguel, 2Recski, Gábor, 91, 98Reddy, Dinesh, 115Redling, Benjamin, 88Reed, Chris, 35, 135Regueira, Xose Luis, 134Rehbein, Ines, 36Rehm, Georg, 55, 85Reichel, Uwe, 28, 134Reichel, Uwe D., 26Rekabsaz, Navid, 144Rello, Luz, 4, 33Remus, Steffen, 125Renals, Steve, 62Renau, Irene, 52Rendeiro, Nuno, 21, 96Renner-Westermann, Heike, 154Rey-Villamizar, Nicolas, 118Reynaert, Martin, 34, 44Ribeiro, Eugénio, 134Ribeiro, Ricardo, 134Ribes-Lafoz, María, 151Ribeyre, Corentin, 123Riccardi, Giuseppe, 5, 99, 153Richardson, John, 49Richart, Cécile, 39Richter, Viktor, 120Rieser, Verena, 75Rigau, German, 3, 51, 59, 93Rikters, Matiss, 21Rinaldi, Fabio, 129Rink, Bryan, 146Rinke, Esther, 155Ritchie, Phil, 144
Rituma, Laura, 89Rizzo, Giuseppe, 19, 151Roberts, Kirk, 115, 130Roche, Mathieu, 41, 58, 118Rodrigues, Filipe, 118Rodríguez, Alejandro, 78Rodríguez, Eric, 78Rodriguez, Kepa, 71Rodriguez, Laritza, 115Rodríguez-Fernández, Sara, 80Rodriguez-Ferreira, Teresa, 14Roesiger, Ina, 6, 60Roesner, Immer, 27Rohwer, Richard, 72Romary, Laurent, 66Ronzano, Francesco, 107, 137Rosá, Aiala, 72Rosén, Victoria, 80, 123Rosenberg, Andrew, 127Rospocher, Marco, 31, 51Rossato, Solange, 25, 48Rosset, Sophie, 49, 65, 80, 95, 110, 116Rossini Favretti, Rema, 72Rosso, Paolo, 149Roth, Dan, 141Roux, Justus, 85Roziewski, Szymon, 97Rozis, Roberts, 44Ruan, Chong, 51Rubens, Neil, 152Rudnicka, Ewa, 83Rudnicky, Alexander, 109Rudra, Koustav, 57Ruiz, Pablo, 66Ruppenhofer, Josef, 100Rus, Vasile, 42, 130Russell, Martin, 8Russo, Irene, 88, 113Ruths, Derek, 7Rychlik, Piotr, 79Rychlý, Pavel, 14Ryzhova, Daria, 43Rzymski, Christoph, 132
SS, Sreelekha, 22Sabetghadam, Serwah, 144Sack, Harald, 115Sadamitsu, Kugatsu, 22
180
Sadeque, Farig, 118Saerens, Marco, 134Saggion, Horacio, 107, 115, 137Saha, Shyamasree, 142Sahlgren, Magnus, 12Saidi, Arash, 51Saint-Dizier, Patrick, 7, 34, 46Saito, Itsumi, 22Sajous, Franck, 47Sakaki, Shigeyuki, 104Sakti, Sakriani, 76Salameh, Mohammad, 2Salchak, Aelita, 89Salden, Uta, 115Salesky, Elizabeth, 11Salim, Soufian, 61Salimbajevs, Askars, 27Salliau, Frank, 144Salloum, Wael, 140Salvetti, Franco, 122Samardzic, Tanja, 140, 159Samier, Quentin, 157Samih, Younes, 144Sammons, Mark, 141San Vicente, Iñaki, 33, 102Sánchez, Noelia, 151Sandell, Monica, 7Sanders, Eric, 103, 111, 113Sangati, Federico, 80, 98Sänger, Mario, 39Santos, Ana Lúcia, 47Santos, Diana, 152Santos, Eddie Antonio, 112Santos, Fábio, 150Santus, Enrico, 158, 159Saralegi, Xabier, 14, 33Sarasola, Kepa, 77Sarasola, Xabier, 26Saraswati, Jaya, 106Saratxaga, Ibon, 26Sarhimaa, Anneli, 113Sasa, Yuko, 52Sasada, Tetsuro, 23, 49Sasaki, Felix, 144Sassolini, Eva, 114Saulite, Baiba, 89Saurí, Roser, 16Savary, Agata, 78, 80, 131
Scarton, Carolina, 126Schäfer, Roland, 155Schang, Emmanuel, 131Scharl, Arno, 116Scheffler, Tatjana, 35Schenner, Mathias, 140Scherer, Stefan, 17Scherrer, Yves, 140Schiel, Florian, 28, 134Schiffhauer, Birte, 120Schlangen, David, 5, 61, 120Schlechtweg, Dominik, 5Schleicher, Thomas, 63Schmidt, Maria, 26Schmidt, Thomas, 10, 52Schmidt-Thieme, Lars, 72Schmitt, Alexander, 3Schneider, Nathan, 137Schneider-Stickler, Berit, 27Schoen, Anneleen, 152Scholman, Merel, 36Scholze-Stubenrecht, Werner, 30Schöne, Karin, 86Schreitter, Stephanie, 49Schröder, Johannes, 25Schuller, Björn, 46, 75Schulte im Walde, Sabine, 42, 79, 90Schultz, Robert T., 73Schultz, Tanja, 25Schulz, Sarah, 149Schulz, Simon, 120Schumann, Anne-Kathrin, 64, 124Schuschnig, Christian, 88Schuster, Sebastian, 82Schuurman, Ineke, 23Schwab, Didier, 144Seara, Roberto, 134Seddah, Djamé, 82, 123Sedlák, Michal, 88Seelig, Laura, 161Segawa, Shuhei, 162Segers, Roxane, 51Segond, Frederique, 39Seibel, Brandon, 103Seitner, Julian, 12Sekulic, Ivan, 93Semenkin, Eugene, 3Sepesy Maucec, Mirjam, 162
181
Seraji, Mojgan, 82Sergienko, Roman, 63Serra, Xavier, 115Serralheiro, António, 138Servan, Christophe, 145Sevcikova, Magda, 45Shaban, Khaled, 152Shafi, Jawad, 91Shah, Kashif, 145Shahrour, Anas, 149Shaikh, Samira, 39, 130Shamsfard, Mehrnoush, 94Shan, Muhammad, 63Sharjeel, Muhammad, 63Sharma, Dipti, 60, 82, 90, 159Sharma, Himanshu, 60Sharoff, Serge, 127Sheikh, Imran, 133Shen, Wade, 11Sheridan, Páraic, 34Shi, Huaxing, 101Shindo, Hiroyuki, 58Shiue, Yow-Ting, 8Shooshan, Sonya, 115Shrestha, Niraj, 49Shrestha, Prasha, 118Shukla, Rajita, 106Sidarenka, Uladzimir, 39Sidorov, Maxim, 3Sierra, Gerardo, 146Siklósi, Borbála, 45Silva, Guilherme, 86Silva, João, 54, 105Silveira, Natalia, 57Sim Smith, Karin, 142Simi, Maria, 58Simkó, Katalin Ilona, 100Simões, Alberto, 93Simonyi, András, 84Simov, Kiril, 105Simunic, Roman Nino, 98Singh, Dhirendra, 81, 106Sitaram, Sunayana, 119Skadina, Inguna, 21Skadinš, Raivis, 44Skoumalová, Hana, 88Škrabal, Michal, 88Skrelin, Pavel, 67
Smith, Daniel, 90Smrz, Pavel, 114Šnajder, Jan, 58, 93, 105Sobhani, Parinaz, 136Søgaard, Anders, 30Sohn, Sunghwan, 15Solda Kutzmann, Donatella, 14Soler, Juan, 44Solorio, Thamar, 118Sommerdijk, Bridget, 103Song, Zhiyi, 129Sordo, Mohamed, 115Sørensen, Nicolai Hartvig, 30Soria, Claudia, 88, 113Soriano Morales, Edmundo Pavel, 81Soroa, Aitor, 140, 156Sosoni, Vilelmini, 1Specia, Lucia, 78, 107, 126, 127, 142Spektors, Andrejs, 89Speranza, Manuela, 152Sperber, Matthias, 69Spitkovsky, Valentin I., 30Sproat, Richard, 69Sprugnoli, Rachele, 14, 15, 121Srijith, P. K., 151Srikumar, Vivek, 141Štajner, Sanja, 21, 96Stankovic, Ranka, 18Staš, Ján, 66, 162Stede, Manfred, 35, 36, 59Steen, Julius, 82Štefanec, Vanja, 112Stefanov, Kalin, 153Stefas, Mickael, 11, 49Steffen, Diana, 30Stegen, Florian, 161Stein, Achim, 24, 83Steinberger, Josef, 29Steinberger, Ralf, 19Steiner, Petra, 38Stenger, Irina, 145Stent, Amanda, 17, 107Štepánek, Jan, 61Stepanov, Evgeny, 5, 153Stevens, Christopher, 109Stoitsis, Giannis, 144Stokowiec, Wojciech, 97Straka, Milan, 45, 149
182
Straková, Jana, 149Stranák, Pavel, 88, 100Stranisci, Marco, 101Strapparava, Carlo, 131Strassel, Stephanie, 32, 102, 114, 129, 147, 157Strik Lievers, Francesca, 79Strik, Helmer, 8, 28Strötgen, Jannik, 128Strzalkowski, Tomek, 39, 130Stüker, Sebastian, 64Su, Keh-Yih, 101Suderman, Keith, 16, 87Sukhareva, Maria, 51, 74Sulea, Octavia-Maria, 117Sumita, Eiichiro, 55, 77Sun, Ming, 109Sundberg, Gunlög, 7Surdeanu, Mihai, 6, 11, 103Sutcliffe, Richard, 13Suzuki, Kanta, 54Sylak-Glassman, John, 108, 109Szabó, Martina Katalin, 100
TTaatgen, Niels, 109Tachibana, Ryuichi, 13Tack, Anaïs, 8Tadic, Marko, 147Takahashi, Fumihiko, 161Takamura, Hiroya, 3Takeuchi, Moe, 147Tambouratzis, George, 20Tamburini, Fabio, 41, 72Tamchyna, Aleš, 123Tamisier, Thomas, 11, 49Tamres-Rudnicky, Yulian, 109Tanaka, Takaaki, 57Tanev, Hristo, 40Tannier, Xavier, 39, 67Tateisi, Yuka, 132Tavarez, David, 26Teh, Phoey Lee, 91Teich, Elke, 67Teisseire, Maguelonne, 58Tekiroglu, Serra Sinem, 131Telaar, Dominic, 25Tellier, Isabelle, 20, 128Temnikova, Irina, 10, 17, 97, 126Teng, Zhiyang, 8
Teraoka, Takehiro, 160Terbeh, Naim, 73Tetreault, Joel, 17Tettamanzi, Andrea, 10Teufel, Simone, 152Thadani, Kapil, 107Thater, Stefan, 7, 30, 59, 121Thomas, Beverley, 44Thomaschewski, Jörg, 4Thompson, Paul, 63Thunes, Martha, 123Tian, Ran, 160Tian, Tian, 20Tian, Ye, 61Tiedemann, Jörg, 32, 122Tim, Oates, 93Timmermans, Benjamin, 74Timmons, Tamara, 118Tjong Kim Sang, Erik, 44Tkachenko, Alexander, 85Tobin, Richard, 136Todo, Naoya, 96Tokunaga, Takenobu, 152Tolins, Jackson, 120, 121Tomlinson, Marc, 146Tonelli, Sara, 14, 31Toral, Antonio, 102, 103, 157Toussaint, Yannick, 151Toutanova, Kristina, 23Tracey, Jennifer, 102, 114, 157Trancoso, Isabel, 134Tratz, Stephen, 81Traum, David, 5, 95Trilsbeek, Paul, 86Trippel, Thorsten, 86Trips, Carola, 38Trmal, Jan, 46Troncy, Raphael, 19Trouvain, Juergen, 46Trtovac, Aleksandra, 18Trunecek, Petr, 88Tsarfaty, Reut, 57Tsuchiya, Tomoyuki, 153Tsuruoka, Yoshimasa, 49Tsvetanova, Liliya, 52Tu, Zhaopeng, 95Tufis, Dan, 87Tulkens, Stephan, 143
183
Tuomisto, Matti, 113Turtle, Howard R., 40Tuttle, Siri, 114Tyers, Francis, 89, 90
UUchimoto, Kiyotaka, 77Uematsu, Sumire, 57Ueno, Hiroshi, 95Umata, Ichiro, 147Ungar, Lyle, 129Unger, Christina, 121Uresova, Zdenka, 83, 138Uria, Larraitz, 37Urizar, Ruben, 152Uro, Jim, 70Uryupina, Olga, 71Ushiku, Atsushi, 23, 49Uszkoreit, Hans, 84, 117Utiyama, Masao, 55, 77Utka, Andrius, 88Utsuro, Takehito, 66Uva, Antonio, 15Uzair, Muhammad, 28
VVacher, Michel, 48, 68Vaidya, Ashwini, 82Vala, Hardik, 7Valadas Pereira, Rita, 54Valderrama, Jorge, 29Valenzuela-Escárcega, Marco A., 6, 11Vallet, Félicien, 70Valli, André, 79Vallmitjana, Jordi, 17Valmaseda, Carlos, 35Van de Velde, Hans, 162van den Bosch, Antal, 1, 44, 103van den Heuvel, Henk, 35, 113, 162van der Goot, Rob, 23Van der Kuip, Frits, 162van der Sijs, Nicoline, 44, 113Van der Veen, Bas, 86van Erp, Marieke, 19, 151, 152Van Eynde, Frank, 23van Genabith, Josef, 21, 55Van hamme, Hugo, 133van Harmelen, Martin, 48Van Hee, Cynthia, 62
van Hout, Roeland, 111Van Huyssteen, Gerhard, 23van Leeuwen, David, 162van Miltenburg, Emiel, 74Van Niekerk, Daniel, 23van Son, Chantal, 41, 152van Stipriaan, René, 44Vanallemeersch, Tom, 123Vandeghinste, Vincent, 23, 123Vanin, Aline, 6Varela, Rocio, 68, 134Varga, Viktor, 100Vasilaki, Kyriaki, 120Vasiljevs, Andrejs, 44, 55Väyrynen, Jaakko, 19Vela, Mihaela, 21, 78Velldal, Erik, 60Vempala, Alakananda, 132Venturi, Giulia, 4Verdonik, Darinka, 162Verhagen, Marc, 16, 87Verhoeven, Ben, 56Vernerová, Anna, 29, 30Versley, Yannick, 82Verstoep, Kees, 156Verwimp, Lyan, 133Vetulani, Grazyna, 99Vetulani, Zygmunt, 99Vidra, Jonáš, 45Vieira, Renata, 6, 66, 71Vieu, Laure, 131Vilares, David, 144Villata, Serena, 43Villavicencio, Aline, 42, 80, 92, 127Villegas, Marta, 31Villemonte de la Clergerie, Eric, 123Vincze, Veronika, 100Virone, Daniela, 56Viswanathan, Akshay, 12Viszlay, Peter, 162Vitkut-Adžgauskien, Daiva, 88Vitvar, Tomas, 115Vo, Ngoc Phuoc An, 117Vogel, Stephan, 126Voisin, Sylvie, 133Volk, Martin, 34Volodina, Elena, 7, 8Volskaya, Nina, 67
184
Von Reihn, Daniel, 86Vondricka, Pavel, 88vor der Brück, Tim, 53Vossen, Piek, 41, 51, 59Vulcu, Gabriela, 15
WWachsmuth, Sven, 120Wacker, Philippe, 16Wagner, Agnieszka, 162Wagner, Petra, 120Wagner, Sven, 115Waibel, Alex, 64, 69Waitelonis, Joerg, 151Wald, Mike, 25Walker, Kevin, 147Walker, Marilyn, 36, 110, 120, 121, 154Walker, Martin, 63Wallner, Franziska, 10Walshe, Brian, 56Walther, Désirée, 50Wambacq, Patrick, 133Wan, Yan, 68Wang, Cheng, 23Wang, Josiah, 106Wang, Lin, 37Wang, Longyue, 95Wang, Meikun, 118Wang, Shih-Ming, 94Wang, Yingying, 121Wanner, Leo, 44, 71, 80Wanzare, Lilian D. A., 121Wartena, Christian, 72Washington, Jonathan, 89Watanabe, Ryoko, 153Wawer, Aleksander, 101Way, Andy, 1, 34, 77, 95Webber, Bonnie, 137Weichselbraun, Albert, 116Weigert, Kathrin, 10Weiner, Jochen, 25Wellner, Christian, 30Wendelstein, Britta, 25Werner, Steffen, 26Westpfahl, Swantje, 10, 52White, Michael, 110Wi, Chung-Il, 15Wieling, Martijn, 114Wierzchon, Piotr, 142
Wijnhoven, Kars, 110Wilkens, Rodrigo, 80, 127Wilkinson, Bryan, 93Windhouwer, Menzo, 86Wintner, Shuly, 145Wisniewski, Guillaume, 53Witkowski, Wojciech, 83Witt, Andreas, 98, 124Wolff, Christian, 70Wolinski, Marcin, 90Wong, Tak-sum, 58Wong, Timothy, 133Wonsever, Dina, 72, 127Wörtwein, Torsten, 17Wottawa, Jane, 112Wrede, Britta, 50, 120Wrede, Sebastian, 50, 120Wright, Jonathan, 147Wu, Stephen, 15, 118Wu, Xiaofeng, 77Wu, Yi, 135Wubben, Sander, 143Wyner, Adam, 13, 117
XXia, Fei, 119Xiao, Liumingjing, 51Xiong, Dan, 133Xu, Feiyu, 84, 117Xu, Hongzhi, 139Xu, Yong, 22Xue, Nianwen, 32
YYaguchi, Manabu, 77Yahya, Emad, 152Yamada, Masaru, 139Yamamoto, Seiichi, 9, 147Yaneva, Victoria, 10, 17Yang, An, 51Yang, Diyi, 45Yang, Haojin, 23Yang, Jie, 8Yang, Yating, 35Yangarber, Roman, 108Yanovich, Polina, 107Yarowsky, David, 108, 109Yates, Amy, 118Yates, Andrew, 104
185
Yeh, Eric, 72Yetisgen, Meliha, 119Yeung, Chak Yan, 37Yilmaz, Emre, 28, 162Yokomori, Daisuke, 153Yoshino, Koichiro, 76, 161Young, Steve, 63Yu, Hwanjo, 128Yu, Roy Shing, 133Yu, Zhiwei, 4Yuan, Yu, 127Yvon, François, 22, 53
ZŽabokrtský, Zdenek, 4, 45Zaghouani, Wajdi, 64, 126Zaiß, Melanie, 42Zampieri, Marcos, 21, 62, 142Zaragoza, Hugo, 29Zarcone, Alessandra, 121Zargayouna, Haifa, 128Zarghili, Arsalan, 33Zarrieß, Sina, 5Zasina, Adrian Jan, 88Zayed, Omnia, 32Zeman, Daniel, 4, 57, 138Zesch, Torsten, 148Zeyrek, Deniz, 125Zgank, Andrej, 162Zhang, Jiajun, 35Zhang, Junhao, 51Zhang, Meishan, 8Zhang, Wanru, 104Zhang, Xiaojun, 95Zhang, Yue, 8, 23, 30, 46Zhang, Ziqi, 78Zhao, Chen, 66Zhao, Tiejun, 101Zhao, Wenli, 135Zhou, Hao, 23Zhou, Xi, 35Zhu, Xiaodan, 136Zi, Wenjie, 103Ziai, Ramon, 136Ziemski, Michał, 122Zilio, Leonardo, 92, 127Zimmerer, Frank, 46Zinn, Claus, 86Zipser, Florian, 155
Zong, Chengqing, 35Zorn, René, 120Zrigui, Mounir, 73Zséder, Attila, 148Zubiaga, Arkaitz, 102Zuccon, Guido, 67Zweigenbaum, Pierre, 70, 80, 110Zydron, Andrzej, 1
186