Collecting Software Localization Data

Collecting Software Localization Data

Steven R. Loomis #0033562!

Thomas Edison State College: Christine Hansen, Mentor!

Liberal Arts Capstone : APR2014 LIB-495-OL11!

August 30, 2014!

!

�1


Abstract!

!Software localization data is critical to modern software which is used in multiple

countries and languages. The author investigated existing literature around software

localization and conducted a survey of persons involved in collecting and maintaining

locale data. The results of the survey were analyzed to determine the major challenges

in this field.!

�2


Contents!

Abstract! 2! Contents! 3! List of Tables and Figures! 4! Chapter 1: Introduction! 5! Chapter 2: Literature Review! 8! Chapter 3: Methodology! 10! Chapter 4: Results of the Study! 12! Chapter 5: Summary and Discussion! 16! References! 20! Appendix! 21

�3


List of Tables and Figures!

Figure 4.1: Key Words and Findings! 15! Figure 5.1: Key Words and Findings! 17! Figure A.1: Questionnaire distributed to respondents.! 21

�4


Chapter 1: Introduction!

This paper discusses how software localization data can be collected and

managed on a large scale. It is based on first hand knowledge of the author through 17

years in the localization industry, and ten years specifically involved in collecting and

managing such data, surveys of current literature, and research among those working in

the field.!

Background of the Topic!

Software localization has an impact on practically every one, no matter what

country they live in or what language they speak. If you ask, “What time is it?” to many

people today, what is their response? More and more people will respond by pulling out

their phone and looking at the time given there. The time and date listed on their phone

screen is generated according to software localization, and follows (to a greater or

lesser extent) the cultural and linguistic requirements of its user. This is software

localization. Collecting and managing the data required to accomplish such tasks

requires a significant amount of effort.!

Problem Statement!

How can this localization information be collected and managed on a large

scale? There are a great many languages in use in the world today, and the level of

technology use in all countries continues to rise. Furthermore, computers, and even

mobile phones, are expected to accomplish increasingly complicated tasks for their

owners. This creates the question about where all of the localization information comes

from, who is responsible for it, and how it is managed.!

�5


The sub questions are: What is the response of people to the task of providing

information about hundreds or thousands of items in their language? What are the

challenges faced in finding experts, collecting data, and analyzing results? What are

ways that this situation can be improved?!

Professional Significance of your Work!

This field has been the major focus of my professional work for the last ten years.

And yet, I for one, have not taken the time to consider the larger context in such a

systematic way.!

Overview of Methodology!

In this project I analyze the work I am familiar with in the field, including a

historical survey of how this problem has been tackled. This involves literature research,

but also personal contacts and questions with a broad range of colleagues. For

answering questions about finding experts and results, I am making use of my contacts

in the field from a very broad range of institutions and organizations. This project does

not include a rigorous survey or questionnaire, but instead interviews with selected

influential people in this field, with analysis and distillation of those conversations.!

Delimitations!

The scope of this project is core localization data that is used for an operating

system or application environment in an office/desktop setting, but not specific to any

particular vendor, project, or specific end user application. For example, this includes

how dates and times are presented, and translation of language and country names in

various languages. This project does not consider as in-scope the localization of a

�6


specific application (such as Microsoft Word) as to the translation of its menus or help

messages, nor the translation of any particular book or document.!

Definition of Terms!

• Software: The realm of computer programs, the instructions which cause

computers to perform specific tasks. Microsoft Word is an examples of a software

program, also called an application. Linux, Mac OSX or Windows are examples of

Operating System Software. !

• Linguistic and cultural preferences: These are the particular spoken/written

conventions used by a particular group within a particular country. For example, these

preferences vary between English speakers in the United States and English

Speakers in the United Kingdom (spelling color versus colour, using Dollars versus

Pounds for currency), and would further vary with French speakers in France

(different words for days of the week in French vs English), and Spanish speakers in

the United States (Dollar would be used for currency, but days of the week in

Spanish). !

• Software Localization: The task of adapting software to properly reflect a

population group’s linguistic and cultural preferences.!

• Software Localization Data: the information required to localize software, for

example, knowing whether the first month of the year is spelled January or janvier

(French).!

Summary!

This project intends to thoroughly research the question of where localization

data comes from, and where it is going.!

�7


Chapter 2: Literature Review!

This chapter will review the available literature around software localization data.

The literature will show that software localization data is critical to proper

internationalization of software. There is not a lot of literature about the collection

process itself, but I will attempt to show with the available content the need and

usefulness of this data.!

Literature Review!

As Purvis, et al. (2001) define these two terms, Internationalization is “the

process of designing a software system so it can be easily adapted to various

languages and regions without major engineering changes”, while Localization is “the

process of adapting software for a specific region or language”. In their article, “A

Practical Look at Software Internationalisation”, this New Zealand-based team

discusses the purpose of software internationalization by first discussing how

associations vary across cultures. They show that “[color] conventions have

considerably different interpretations across cultures” and they give some sample

meanings. In the US, “red” means danger. But in France it can mean aristocracy, in

Egypt, death, in India life and creativity, in Japan, anger or danger, and in China,

happiness. Purvis, et al. go on to describe how a learning system was internationalized

using Java processes. But, first, they describe that Java comes with locale sensitive

operations such as formatting numbers according to local currency requirements.!

In Gruman (2012), Chapter 27 on setting system preferences describes many

customizations which can be made with Apple Macintosh OS X “Mountain Lion”. The

purpose of these is to conform to “the peculiarities of your equipment, software,

�8


environment, and, yes, personal preferences”. Some of those preference have to do

with your language and cultural preferences. Items shown include not only which

language is used, but how dates and times are shown. The default views are modified

to reflect the language and country chosen, but the user can click “customize” and enter

in specific personal preferences.!

Similarly, in Chapter 6 of the study guide to using the X Window System, Smith

(2013) describes equivalent settings to those we have seen in Mac OSX by introducing

settings such as LC_ALL to set the locale (language) used in the X Window System. It

is explained that choosing a language results in different behavior depending on that

language’s requirements, which presupposes that there is locale data available to be

able to fulfill those requirements.!

Tackling the C++11 programming language itself, Lischner (2013) introduces

readers to the std::locale family of operations, introducing the problem of variant number

formatting in different languages, such as where one convention (such as U. S. English)

would format “one thousand point one” as “1,000.1” whereas others (German, for

example) would write the sane number as “1.000,1”.!

Summary!

In conclusion, we see that throughout modern software design practices, and in

modern operating systems, users are enabled to see information presented in the

language(s) of their choice, using their own linguistic and cultural conventions. Though

not often referenced explicitly in the literature, it is implied that there is locale data

available to back the user’s requests. That data must come from somewhere.!

�9


Chapter 3: Methodology!

The overall research question is as follows: How can software localization data

be collected and managed on a large scale? To answer this, I want to first establish the

definition of locale data, then establish the need for locale data. As final background, I

will explore the history of locale data. The next sub research question is, What is the

response of different sets of people to the task of providing information about their

language? and What are the challenges to securing this information?!

I plan to make a survey of users who are actually involved in this task, and find

out about their experiences and results.!

As far as the challenges, I will make a thorough literature review, and also

contact, interview, and discuss with key persons involved in locale data management in

industry and in open source software organizations.!

Research Design and Methodology!

Next, I will locate the right people to ask the question of, start to interview them

and ask questions, and then tabulate the response. Given the response, I will analyze it

and write the remainder of the paper.!

The first sub question is: What is the response of different sets of people to the

task of providing information about their language? I will contact people involved in

collecting this data, and ask them about their tasks of providing this information. This is

the first sub question.!

The second question is: What are the challenges to securing this information?

For this question, similarly, I will contact people who are involved in processing and

securing the locale data.!

�10


For both of these questions I will exploit professional contacts in the relevant

fields and ask targeted questions of specific persons.!

Summary!

For data analysis, I will first take transcripts of recording (for face to face or audio

interviews), or if they are email based I will take the textual content of that email. I will

produce a single document containing all of these conversations.!

Secondly, I will try to summarize the main findings for each sub question, as well

as specific quotes from people. My analysis will consist of attempting to distill the main

points from each sub point into a coherent statement which answers those sub points in

terms of the findings.!

�11


Chapter 4: Results of the Study!

As stated in Chapter 1, this study discusses the question of how software localization

data can be collected and managed on a large scale. The first question asked was to

find out what the overall challenges found in collecting and maintaining software

localization data are. Secondly, it was asked what types of responses have been seen

when asking non-technical persons to contribute software localization data.!

Question 1: Overall Challenges!

The questionnaire (see Appendix) and a personal interview have provided the

following data points.!

One respondent pointed out that previously, a single computer could not work

with more than one language’s writing system at a time. So, an engineer could not work

on the localization of more than one language or writing system at a time. At that time

(prior to about 1988) it was not seen as a general requirement for computers to handle

more than one language, although this is taken as a given today.!

Another respondent mentioned “Managing the size and quality of the data” as a

challenge.!

One respondent mentioned two major challenges. The first is that locale data can

be consumed by software in a very complex way, and so it is difficult when collecting the

data to know how exactly it will be used. The second major challenge mentioned by this

respondent is consistency within a language, within a family of languages, across

related data items— “For example, semantics of long/short/narrow forms across locale

items (such as month, weekday), or across locales". Also mentioned included the fact

that required data may change and so “it's not quite easy to collect the localization data

�12


in timely manner”. Often, data collection is done relative to English, however “some

locales may have specific term for two days before/after from today, while such might

not be used commonly in English.” Consistency over time versus personal preference

was also mentioned, in that “there might not be a single definitive localized form for an

locale data item. One person might prefer one over another, while another person may

prefer opposite.” !

Another respondent noted the difficulty that some languages do not have well

established sets of guidelines to work from. In those cases “surveying the common best

practice is not very easy and personal style and perspective of the experts commenting

on the issues sometimes causes conflicts which is not easy to resolve”. The second

issue mentioned by this respondent was the scope of locale data, and gave an example

of shoe sizes (the international correspondence of them). ”The paradox here is that if a

specific, stand-alone library is to be created for such data, the adaptability and

popularity rate for such a limited and specific library would not be high. The ROI of using

an external library just for such limited case is very low.” This respondent also

mentioned “the danger of extending what's common and needed in some locale to all

locales. A good example is all the issues raised during translating some Imperial units

(e.g. "cubic foot", "pint", etc.) … Some of these units are never heard of and never used

in some locales. Asking the experts on the ground to provide data for these would end

up with them inventing things which never existed. The counter-argument to this might

be that in a global economy of software a common set of these units are needed so by

introducing them in a Metric locale like Persian, there would be an opportunity to cause

some conversation about them among language experts in those locales.” This

�13


respondents final point was given as a segue into question two, and that is “the burden

of collecting valid locale-specific data in context.”!

Question 2: Responses from Non-Technical Persons!

The questionnaire (see Appendix A) and a personal interview have provided the

following data points for question two.!

On the topic of responses from non-technical persons, one respondent noted that

“If the issues are at all tricky, the quality of the data becomes suspect.”!

Another respondent noted two issues with collecting data from non technical

persons because “localization data contributors are not often software experts. Many of

them are translators and they tend to provide translations for every single English word,

which is sometimes not great as locale data. When a translator does not fully

understand what the item is for, we want she/he to interact with us to clarify the context,

target use, etc.”!

One other respondent described the range in quality of responses by non-

technical uses as at best referring to “some best common practices (style guides,

documents from national standard bodies and authorities on the language) which

helped making an informed decision about the item at hand” but at worst the information

was “some invented data based on personal taste and style”. This respondent also

noted the “jargon” associated with locale data as causing confusion.!

Summary!

Four respondents have responded as of this writing, three on the form and one

personal interview which was prerecorded. Figure 4.1 summarizes the key words and

concepts from each question.!

�14


Figure 4.1: Key Words and Findings!

�15

Figure 4.1: Key Words and Findings!!• Question One (Challenges)!

• data: size, quality, complexity, consistency, scope!• language: consistency!

• Question Two (Responses)!• data: quality, complexity


Chapter 5: Summary and Discussion!

This final chapter will summarize and discuss information gathered about how

software localization data can be collected and managed on a large scale. As Purvis, et

al. (2001) define the two terms, Internationalization is “the process of designing a

software system so it can be easily adapted to various languages and regions without

major engineering changes”, while Localization is “the process of adapting software for

a specific region or language”. Therefore, localization data is the basic information

needed to adapt software.!

Statement of Problem!

As explained in Chapter 1, the question arises as to how this localization

information can be collected and managed on a large scale? There are a great many

languages in use in the world today, and the level of technology use in all countries

continues to rise. Furthermore, computers, and even mobile phones, are expected to

accomplish increasingly complicated tasks for their owners. This creates the question

about where all of the localization information comes from, who is responsible for it, and

how it is managed.!

The sub questions are: What is the response of people to the task of providing

information about hundreds or thousands of items in their language? What are the

challenges faced in finding experts, collecting data, and analyzing results? What are

ways that this situation can be improved?!

Review of Methodology!

As noted in the previous chapter, respondents were emailed a link to an online

form with a brief cover letter. The form had the two main questions in it, as well as a

�16


space for further contact email. Information was collected and stored on a Google Form.

Some respondents were asked the same questions in person and these answers were

recorded.!

Summary of Results!

Figure 5.1 restates the summary given in Figure 4.1.!

Figure 5.1: Key Words and Findings!Complexity/consistency as a group were the most mentioned topics.!

Relationship of Research to the Field!

The research validated, and the literature review indirectly validated the need

and existence of this localization data in modern software systems. For example,

Lischner (2013) assumes that C++11 is able to request the “std::locale” operations to

format numbers according to different language and country requirements, and make

the case that different languages and countries have differing cultural and linguistic

requirements. This presumes that such information as varies by language and country

is somehow available to software.!

It was difficult to find reference material referring directly to these specialized

topics. There may be a need for further publications on this topic.!

�17

Figure 5.1: Key Words and Findings!!• Question One (Challenges)!

• data: size, quality, complexity, consistency, scope!• language: consistency!

• Question Two (Responses)!• data: quality, complexity


Discussion of Results!

What was the significance of your findings? Explain how your work adds to the

body of knowledge in your field.!

My work shows in the literature review the recognized importance of locale data,

even though the lack of specific literature is an argument from silence that this data is in

general taken for granted. By surveying some of those involved in the process of locale

data collection and management, it was found that there are plenty of concerns about

the entire process and the challenges faced. Complexity was cited several times as a

concern, both for those developing data, but also those providing input to it, and those

using this data. With this complexity comes a concern about the quality of the data. As

some respondents noted, if complex information is not well understood, it will not be

well translated. This is related to the question about scope, which is a question of what

level of detail should be included in commonly used localization systems. If a system is

very specialized, it may not get widespread use, but a non-specialized system may not

cover all user required data items.!

Another concern is the size of the data, which is a concern also cited, as well as

consistency within and between languages. The most recent Ethnologue survey by

Lewis, et al. (2014) notes over 7,000 living languages. The actual number which most

computer systems support in written form is much smaller, but this figure does give the

potential scope of the problem of specialization.!

�18


Conclusions!

As a result of this research, the case is clearly made that the task of collecting

and managing software localization data does face many challenges, but that it is and

will continue to have a vital role in today’s information economy.!

�19


References!

Gruman, Galen. ( © 2012). Os x mountain lion bible. [Books24x7 version] Available from

http://common.books24x7.com/toc.aspx?bookid=49697.MM (Accessed August 14,

2014)!

Purvis, M., Hwang, P., Purvis, M., Madhavji, N., & Cranefield, S. (2001). A practical look

at software internationalisation. Journal Of Integrated Design & Process Science, 5(3),

79.!

Smith, Roderick W.. "Chapter 6 - Configuring the X Window System, Localization, and

Printing". CompTIA Linux+ Study Guide: Exams LX0-101 and LX0-102, 2nd Edition.

Sybex. © 2013. Books24x7. <http://common.books24x7.com/toc.aspx?bookid=51120>

(accessed August 14, 2014)!

Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.). 2014. Ethnologue:

languages of the world, seventeenth edition. Dallas, Texas: SIL International. Online

version: http://www.ethnologue.com. (Accessed August 30, 2014)!

Lischner, Ray. ( © 2013). Exploring c++ 11: problems and solutions handbook.

[Books24x7 version] Available from http://common.books24x7.com/toc.aspx?

bookid=62115. (accessed August 14, 2014)!

�20

http://common.books24x7.com/toc.aspx?bookid=49697.MM

http://www.ethnologue.com


Appendix!

The questionnaire appeared on-screen as follows.!

Figure A.1: Questionnaire distributed to respondents.

�21

Figure A.1: Questionnaire!

Collecting Software Localization Data

Documents

Transcript of Collecting Software Localization Data