Collecting Software Localization Data
Transcript of Collecting Software Localization Data
Collecting Software Localization Data
Steven R. Loomis #0033562!
Thomas Edison State College: Christine Hansen, Mentor!
Liberal Arts Capstone : APR2014 LIB-495-OL11!
August 30, 2014!
!
�1
Collecting Software Localization Data
Abstract!
!Software localization data is critical to modern software which is used in multiple
countries and languages. The author investigated existing literature around software
localization and conducted a survey of persons involved in collecting and maintaining
locale data. The results of the survey were analyzed to determine the major challenges
in this field.!
�2
Collecting Software Localization Data
Contents!
Abstract! 2! Contents! 3! List of Tables and Figures! 4! Chapter 1: Introduction! 5! Chapter 2: Literature Review! 8! Chapter 3: Methodology! 10! Chapter 4: Results of the Study! 12! Chapter 5: Summary and Discussion! 16! References! 20! Appendix! 21
�3
Collecting Software Localization Data
List of Tables and Figures!
Figure 4.1: Key Words and Findings! 15! Figure 5.1: Key Words and Findings! 17! Figure A.1: Questionnaire distributed to respondents.! 21
�4
Collecting Software Localization Data
Chapter 1: Introduction!
This paper discusses how software localization data can be collected and
managed on a large scale. It is based on first hand knowledge of the author through 17
years in the localization industry, and ten years specifically involved in collecting and
managing such data, surveys of current literature, and research among those working in
the field.!
Background of the Topic!
Software localization has an impact on practically every one, no matter what
country they live in or what language they speak. If you ask, “What time is it?” to many
people today, what is their response? More and more people will respond by pulling out
their phone and looking at the time given there. The time and date listed on their phone
screen is generated according to software localization, and follows (to a greater or
lesser extent) the cultural and linguistic requirements of its user. This is software
localization. Collecting and managing the data required to accomplish such tasks
requires a significant amount of effort.!
Problem Statement!
How can this localization information be collected and managed on a large
scale? There are a great many languages in use in the world today, and the level of
technology use in all countries continues to rise. Furthermore, computers, and even
mobile phones, are expected to accomplish increasingly complicated tasks for their
owners. This creates the question about where all of the localization information comes
from, who is responsible for it, and how it is managed.!
�5
Collecting Software Localization Data
The sub questions are: What is the response of people to the task of providing
information about hundreds or thousands of items in their language? What are the
challenges faced in finding experts, collecting data, and analyzing results? What are
ways that this situation can be improved?!
Professional Significance of your Work!
This field has been the major focus of my professional work for the last ten years.
And yet, I for one, have not taken the time to consider the larger context in such a
systematic way.!
Overview of Methodology!
In this project I analyze the work I am familiar with in the field, including a
historical survey of how this problem has been tackled. This involves literature research,
but also personal contacts and questions with a broad range of colleagues. For
answering questions about finding experts and results, I am making use of my contacts
in the field from a very broad range of institutions and organizations. This project does
not include a rigorous survey or questionnaire, but instead interviews with selected
influential people in this field, with analysis and distillation of those conversations.!
Delimitations!
The scope of this project is core localization data that is used for an operating
system or application environment in an office/desktop setting, but not specific to any
particular vendor, project, or specific end user application. For example, this includes
how dates and times are presented, and translation of language and country names in
various languages. This project does not consider as in-scope the localization of a
�6
Collecting Software Localization Data
specific application (such as Microsoft Word) as to the translation of its menus or help
messages, nor the translation of any particular book or document.!
Definition of Terms!
• Software: The realm of computer programs, the instructions which cause
computers to perform specific tasks. Microsoft Word is an examples of a software
program, also called an application. Linux, Mac OSX or Windows are examples of
Operating System Software. !
• Linguistic and cultural preferences: These are the particular spoken/written
conventions used by a particular group within a particular country. For example, these
preferences vary between English speakers in the United States and English
Speakers in the United Kingdom (spelling color versus colour, using Dollars versus
Pounds for currency), and would further vary with French speakers in France
(different words for days of the week in French vs English), and Spanish speakers in
the United States (Dollar would be used for currency, but days of the week in
Spanish). !
• Software Localization: The task of adapting software to properly reflect a
population group’s linguistic and cultural preferences.!
• Software Localization Data: the information required to localize software, for
example, knowing whether the first month of the year is spelled January or janvier
(French).!
Summary!
This project intends to thoroughly research the question of where localization
data comes from, and where it is going.!
�7
Collecting Software Localization Data
Chapter 2: Literature Review!
This chapter will review the available literature around software localization data.
The literature will show that software localization data is critical to proper
internationalization of software. There is not a lot of literature about the collection
process itself, but I will attempt to show with the available content the need and
usefulness of this data.!
Literature Review!
As Purvis, et al. (2001) define these two terms, Internationalization is “the
process of designing a software system so it can be easily adapted to various
languages and regions without major engineering changes”, while Localization is “the
process of adapting software for a specific region or language”. In their article, “A
Practical Look at Software Internationalisation”, this New Zealand-based team
discusses the purpose of software internationalization by first discussing how
associations vary across cultures. They show that “[color] conventions have
considerably different interpretations across cultures” and they give some sample
meanings. In the US, “red” means danger. But in France it can mean aristocracy, in
Egypt, death, in India life and creativity, in Japan, anger or danger, and in China,
happiness. Purvis, et al. go on to describe how a learning system was internationalized
using Java processes. But, first, they describe that Java comes with locale sensitive
operations such as formatting numbers according to local currency requirements.!
In Gruman (2012), Chapter 27 on setting system preferences describes many
customizations which can be made with Apple Macintosh OS X “Mountain Lion”. The
purpose of these is to conform to “the peculiarities of your equipment, software,
�8
Collecting Software Localization Data
environment, and, yes, personal preferences”. Some of those preference have to do
with your language and cultural preferences. Items shown include not only which
language is used, but how dates and times are shown. The default views are modified
to reflect the language and country chosen, but the user can click “customize” and enter
in specific personal preferences.!
Similarly, in Chapter 6 of the study guide to using the X Window System, Smith
(2013) describes equivalent settings to those we have seen in Mac OSX by introducing
settings such as LC_ALL to set the locale (language) used in the X Window System. It
is explained that choosing a language results in different behavior depending on that
language’s requirements, which presupposes that there is locale data available to be
able to fulfill those requirements.!
Tackling the C++11 programming language itself, Lischner (2013) introduces
readers to the std::locale family of operations, introducing the problem of variant number
formatting in different languages, such as where one convention (such as U. S. English)
would format “one thousand point one” as “1,000.1” whereas others (German, for
example) would write the sane number as “1.000,1”.!
Summary!
In conclusion, we see that throughout modern software design practices, and in
modern operating systems, users are enabled to see information presented in the
language(s) of their choice, using their own linguistic and cultural conventions. Though
not often referenced explicitly in the literature, it is implied that there is locale data
available to back the user’s requests. That data must come from somewhere.!
�9
Collecting Software Localization Data
Chapter 3: Methodology!
The overall research question is as follows: How can software localization data
be collected and managed on a large scale? To answer this, I want to first establish the
definition of locale data, then establish the need for locale data. As final background, I
will explore the history of locale data. The next sub research question is, What is the
response of different sets of people to the task of providing information about their
language? and What are the challenges to securing this information?!
I plan to make a survey of users who are actually involved in this task, and find
out about their experiences and results.!
As far as the challenges, I will make a thorough literature review, and also
contact, interview, and discuss with key persons involved in locale data management in
industry and in open source software organizations.!
Research Design and Methodology!
Next, I will locate the right people to ask the question of, start to interview them
and ask questions, and then tabulate the response. Given the response, I will analyze it
and write the remainder of the paper.!
The first sub question is: What is the response of different sets of people to the
task of providing information about their language? I will contact people involved in
collecting this data, and ask them about their tasks of providing this information. This is
the first sub question.!
The second question is: What are the challenges to securing this information?
For this question, similarly, I will contact people who are involved in processing and
securing the locale data.!
�10
Collecting Software Localization Data
For both of these questions I will exploit professional contacts in the relevant
fields and ask targeted questions of specific persons.!
Summary!
For data analysis, I will first take transcripts of recording (for face to face or audio
interviews), or if they are email based I will take the textual content of that email. I will
produce a single document containing all of these conversations.!
Secondly, I will try to summarize the main findings for each sub question, as well
as specific quotes from people. My analysis will consist of attempting to distill the main
points from each sub point into a coherent statement which answers those sub points in
terms of the findings.!
�11
Collecting Software Localization Data
Chapter 4: Results of the Study!
As stated in Chapter 1, this study discusses the question of how software localization
data can be collected and managed on a large scale. The first question asked was to
find out what the overall challenges found in collecting and maintaining software
localization data are. Secondly, it was asked what types of responses have been seen
when asking non-technical persons to contribute software localization data.!
Question 1: Overall Challenges!
The questionnaire (see Appendix) and a personal interview have provided the
following data points.!
One respondent pointed out that previously, a single computer could not work
with more than one language’s writing system at a time. So, an engineer could not work
on the localization of more than one language or writing system at a time. At that time
(prior to about 1988) it was not seen as a general requirement for computers to handle
more than one language, although this is taken as a given today.!
Another respondent mentioned “Managing the size and quality of the data” as a
challenge.!
One respondent mentioned two major challenges. The first is that locale data can
be consumed by software in a very complex way, and so it is difficult when collecting the
data to know how exactly it will be used. The second major challenge mentioned by this
respondent is consistency within a language, within a family of languages, across
related data items— “For example, semantics of long/short/narrow forms across locale
items (such as month, weekday), or across locales". Also mentioned included the fact
that required data may change and so “it's not quite easy to collect the localization data
�12
Collecting Software Localization Data
in timely manner”. Often, data collection is done relative to English, however “some
locales may have specific term for two days before/after from today, while such might
not be used commonly in English.” Consistency over time versus personal preference
was also mentioned, in that “there might not be a single definitive localized form for an
locale data item. One person might prefer one over another, while another person may
prefer opposite.” !
Another respondent noted the difficulty that some languages do not have well
established sets of guidelines to work from. In those cases “surveying the common best
practice is not very easy and personal style and perspective of the experts commenting
on the issues sometimes causes conflicts which is not easy to resolve”. The second
issue mentioned by this respondent was the scope of locale data, and gave an example
of shoe sizes (the international correspondence of them). ”The paradox here is that if a
specific, stand-alone library is to be created for such data, the adaptability and
popularity rate for such a limited and specific library would not be high. The ROI of using
an external library just for such limited case is very low.” This respondent also
mentioned “the danger of extending what's common and needed in some locale to all
locales. A good example is all the issues raised during translating some Imperial units
(e.g. "cubic foot", "pint", etc.) … Some of these units are never heard of and never used
in some locales. Asking the experts on the ground to provide data for these would end
up with them inventing things which never existed. The counter-argument to this might
be that in a global economy of software a common set of these units are needed so by
introducing them in a Metric locale like Persian, there would be an opportunity to cause
some conversation about them among language experts in those locales.” This
�13
Collecting Software Localization Data
respondents final point was given as a segue into question two, and that is “the burden
of collecting valid locale-specific data in context.”!
Question 2: Responses from Non-Technical Persons!
The questionnaire (see Appendix A) and a personal interview have provided the
following data points for question two.!
On the topic of responses from non-technical persons, one respondent noted that
“If the issues are at all tricky, the quality of the data becomes suspect.”!
Another respondent noted two issues with collecting data from non technical
persons because “localization data contributors are not often software experts. Many of
them are translators and they tend to provide translations for every single English word,
which is sometimes not great as locale data. When a translator does not fully
understand what the item is for, we want she/he to interact with us to clarify the context,
target use, etc.”!
One other respondent described the range in quality of responses by non-
technical uses as at best referring to “some best common practices (style guides,
documents from national standard bodies and authorities on the language) which
helped making an informed decision about the item at hand” but at worst the information
was “some invented data based on personal taste and style”. This respondent also
noted the “jargon” associated with locale data as causing confusion.!
Summary!
Four respondents have responded as of this writing, three on the form and one
personal interview which was prerecorded. Figure 4.1 summarizes the key words and
concepts from each question.!
�14
Collecting Software Localization Data
Figure 4.1: Key Words and Findings!
�15
Figure 4.1: Key Words and Findings!!• Question One (Challenges)!
• data: size, quality, complexity, consistency, scope!• language: consistency!
• Question Two (Responses)!• data: quality, complexity
Collecting Software Localization Data
Chapter 5: Summary and Discussion!
This final chapter will summarize and discuss information gathered about how
software localization data can be collected and managed on a large scale. As Purvis, et
al. (2001) define the two terms, Internationalization is “the process of designing a
software system so it can be easily adapted to various languages and regions without
major engineering changes”, while Localization is “the process of adapting software for
a specific region or language”. Therefore, localization data is the basic information
needed to adapt software.!
Statement of Problem!
As explained in Chapter 1, the question arises as to how this localization
information can be collected and managed on a large scale? There are a great many
languages in use in the world today, and the level of technology use in all countries
continues to rise. Furthermore, computers, and even mobile phones, are expected to
accomplish increasingly complicated tasks for their owners. This creates the question
about where all of the localization information comes from, who is responsible for it, and
how it is managed.!
The sub questions are: What is the response of people to the task of providing
information about hundreds or thousands of items in their language? What are the
challenges faced in finding experts, collecting data, and analyzing results? What are
ways that this situation can be improved?!
Review of Methodology!
As noted in the previous chapter, respondents were emailed a link to an online
form with a brief cover letter. The form had the two main questions in it, as well as a
�16
Collecting Software Localization Data
space for further contact email. Information was collected and stored on a Google Form.
Some respondents were asked the same questions in person and these answers were
recorded.!
Summary of Results!
Figure 5.1 restates the summary given in Figure 4.1.!
Figure 5.1: Key Words and Findings!Complexity/consistency as a group were the most mentioned topics.!
Relationship of Research to the Field!
The research validated, and the literature review indirectly validated the need
and existence of this localization data in modern software systems. For example,
Lischner (2013) assumes that C++11 is able to request the “std::locale” operations to
format numbers according to different language and country requirements, and make
the case that different languages and countries have differing cultural and linguistic
requirements. This presumes that such information as varies by language and country
is somehow available to software.!
It was difficult to find reference material referring directly to these specialized
topics. There may be a need for further publications on this topic.!
�17
Figure 5.1: Key Words and Findings!!• Question One (Challenges)!
• data: size, quality, complexity, consistency, scope!• language: consistency!
• Question Two (Responses)!• data: quality, complexity
Collecting Software Localization Data
Discussion of Results!
What was the significance of your findings? Explain how your work adds to the
body of knowledge in your field.!
My work shows in the literature review the recognized importance of locale data,
even though the lack of specific literature is an argument from silence that this data is in
general taken for granted. By surveying some of those involved in the process of locale
data collection and management, it was found that there are plenty of concerns about
the entire process and the challenges faced. Complexity was cited several times as a
concern, both for those developing data, but also those providing input to it, and those
using this data. With this complexity comes a concern about the quality of the data. As
some respondents noted, if complex information is not well understood, it will not be
well translated. This is related to the question about scope, which is a question of what
level of detail should be included in commonly used localization systems. If a system is
very specialized, it may not get widespread use, but a non-specialized system may not
cover all user required data items.!
Another concern is the size of the data, which is a concern also cited, as well as
consistency within and between languages. The most recent Ethnologue survey by
Lewis, et al. (2014) notes over 7,000 living languages. The actual number which most
computer systems support in written form is much smaller, but this figure does give the
potential scope of the problem of specialization.!
�18
Collecting Software Localization Data
Conclusions!
As a result of this research, the case is clearly made that the task of collecting
and managing software localization data does face many challenges, but that it is and
will continue to have a vital role in today’s information economy.!
�19
Collecting Software Localization Data
References!
Gruman, Galen. ( © 2012). Os x mountain lion bible. [Books24x7 version] Available from
http://common.books24x7.com/toc.aspx?bookid=49697.MM (Accessed August 14,
2014)!
Purvis, M., Hwang, P., Purvis, M., Madhavji, N., & Cranefield, S. (2001). A practical look
at software internationalisation. Journal Of Integrated Design & Process Science, 5(3),
79.!
Smith, Roderick W.. "Chapter 6 - Configuring the X Window System, Localization, and
Printing". CompTIA Linux+ Study Guide: Exams LX0-101 and LX0-102, 2nd Edition.
Sybex. © 2013. Books24x7. <http://common.books24x7.com/toc.aspx?bookid=51120>
(accessed August 14, 2014)!
Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.). 2014. Ethnologue:
languages of the world, seventeenth edition. Dallas, Texas: SIL International. Online
version: http://www.ethnologue.com. (Accessed August 30, 2014)!
Lischner, Ray. ( © 2013). Exploring c++ 11: problems and solutions handbook.
[Books24x7 version] Available from http://common.books24x7.com/toc.aspx?
bookid=62115. (accessed August 14, 2014)!
�20