Introducing New Features to Wikipedia: Case Studies for Web ...

6
56 1541-1672/11/$26.00 © 2011 IEEE IEEE INTELLIGENT SYSTEMS Published by the IEEE Computer Society SOCIETY ONLINE, PART 2 Introducing New Features to Wikipedia: Case Studies for Web Science Mathias Schindler, Wikimedia Deutschland Denny Vrandec ˇic ´, Karlsruhe Institute of Technology Introducing new features to Wikipedia is a complex sociotechnical process. The authors compare the Web science process to the previous introduction of new features and suggest how to use it as a model for the future development of Wikipedia. available. Currently, it is among the 10 most-visited websites in the world, with more than 50,000 hits per second during peak times. Introducing new technical features to Wikipedia is thus a complex sociotechni- cal process. Wikipedia runs on the open source MediaWiki wiki engine. Because Wikipedia’s technical implementation and content development are both done in free, open projects that include complete change logs, we can trace and analyze the codevel- opment of the technical and social aspects. (MediaWiki’s source code repository is ac- cessible at http://svn.wikimedia.org, and the database dumps of Wikipedia’s complete log history are available at http://download. wikipedia.org.) Because of Wikipedia’s open nature and abundant and freely available project and history data, researchers often use it to study collaborative content creation. Both avail- able research and the examples of previous feature implementations suggest that the better we understand the complex interac- tion between technical progress and social adaption, the more likely the introduction of a new technology to Wikipedia will be successful. Based on previous implementations of new features—including the category sys- tem in 2004, parser functions in 2006, and flagged revisions in 2008—this article dis- cusses the impact of introducing a further new feature: creating semantic annotations. We show how the interaction between the technical features and the community is an instantiation of the Web science process, a model that describes the evolution of the Web and its systems. Previous Wikipedia Upgrades According to the Web science research pro- cess, to resolve issues, engineers come up with new ideas and then implement them. Because the Web is an inherently social infrastructure, W ikipedia is a free Web-based encyclopedia produced by hundreds of thousands of online contributors. 1 Today, Wikipedia offers more than 10 million articles and is available in more than 250 languages. For some of these languages, Wikipedia is not just free, but the only encyclopedia

Transcript of Introducing New Features to Wikipedia: Case Studies for Web ...

56 1541-1672/11/$26.00 © 2011 IEEE IEEE INTELLIGENT SYSTEMSPublished by the IEEE Computer Society

S o c i e t y o n l i n e , P a r t 2

Introducing New Features to Wikipedia: Case Studies for Web ScienceMathias Schindler, Wikimedia Deutschland

Denny Vrandecic, Karlsruhe Institute of Technology

Introducing new

features to Wikipedia

is a complex

sociotechnical

process. The authors

compare the Web

science process

to the previous

introduction of new

features and suggest

how to use it as a

model for the future

development of

Wikipedia.

available. Currently, it is among the 10 most-visited websites in the world, with more than 50,000 hits per second during peak times.

Introducing new technical features to Wikipedia is thus a complex sociotechni-cal process. Wikipedia runs on the open source MediaWiki wiki engine. Because Wiki pedia’s technical implementation and content development are both done in free, open projects that include complete change logs, we can trace and analyze the codevel-opment of the technical and social aspects. (MediaWiki’s source code repository is ac-cessible at http://svn.wikimedia.org, and the database dumps of Wikipedia’s complete log history are available at http://download.wikipedia.org.)

Because of Wikipedia’s open nature and abundant and freely available project and history data, researchers often use it to study collaborative content creation. Both avail-able research and the examples of previous

feature implementations suggest that the better we understand the complex interac-tion between technical progress and social adaption, the more likely the introduction of a new technology to Wikipedia will be successful.

Based on previous implementations of new features—including the category sys-tem in 2004, parser functions in 2006, and flagged revisions in 2008—this article dis-cusses the impact of introducing a further new feature: creating semantic annotations. We show how the interaction between the technical features and the community is an instantiation of the Web science process, a model that describes the evolution of the Web and its systems.

Previous Wikipedia UpgradesAccording to the Web science research pro-cess, to resolve issues, engineers come up with new ideas and then implement them. Because the Web is an inherently social infrastructure,

W ikipedia is a free Web-based encyclopedia produced by hundreds

of thousands of online contributors.1 Today, Wikipedia offers more

than 10 million articles and is available in more than 250 languages. For

some of these languages, Wikipedia is not just free, but the only encyclopedia

IS-26-01-Vran.indd 56 25/01/11 12:06 PM

JaNuarY/FEbruarY 2011 www.computer.org/intelligent 57

an idea’s technical implementation will necessarily involve social aspects. If done well, the idea’s sociotechnical implementation can resolve the origi-nal issue on a micro level. However, because of the Web’s size and com-plexity, such solutions often have un-expected implications on the macro level. Engineers must in turn analyze these implications to identify new is-sues (such as situations that do not conform to certain values), thus start-ing the process anew. (A detailed de-scription of the steps in the process was presented in an earlier work.2) Tim Berners-Lee has called the com-plexity step from the micro solution to the macro solution one of the magics of Web science, and he called the cre-ativity to get from issues to ideas the second magic of Web science.3

To illustrate this process, we de-scribe examples from the history of Wikipedia, tracing and analyzing them from the available data about the Wikipedia project and its history.

Category SystemFollowing Wikipedia’s initial growth, it became necessary to introduce fea-tures to help users explore and navi-gate the site’s content. Early versions of the MediaWiki software offered only the manually created links, a backlinks feature (which displays all pages that link to a specific page), and a full-text search to help users discover content in the encyclopedia. In many other wikis, the backlinks feature was used to implement a ru-dimentary tagging system—that is, once a link was created to a page such as “Greece-related topic” on all rel-evant pages, the list of backlinks on the page “Greece-related topic” was a list of all pages on that topic. This ap-proach has several disadvantages:

• the links to the topic page are dis-played in the text like any other link,

• the linked page itself behaves like a normal article and displaying the backlinks requires a second click on the appropriate link in the tool-bar, and

• normal articles (such as the one for Greece) cannot be used as cat-egory pages because all normal links then also appear in the list, not just those that were introduced for categorizing. (That is, Albania would be in the list of pages linking to Greece because it’s a bordering country.)

Wikipedia often avoided such wiki-specific idiosyncrasies. For example, Wikipedia omitted camel-case syn-tax in favor of free links, so it is often described as an atypical wiki.

The category system was intro-duced to MediaWiki in 2004 to solve this problem. Each article could be put into an arbitrary number of cat-egories that are identified by freely chosen names. Adding a page to the category “Greece” is done by add-ing [[Category:Greece]] to the wiki text. Category links themselves are not displayed in the article text, but in separate visual elements on page rendering. The category page is a distinct page in the newly intro-duced category namespace and thus separate from the article space. Cat-egory pages list all pages belonging to a specific category. The new cate-gory system resolved all the identified technical issues of using backlinks for categorizing articles.

The category system was activated in all language editions at the same time without consulting the respec-tive Wikipedia communities. For most of the languages, it was quickly applied to categorize a majority of the existing pages. Soon new issues were discovered, however, such as how to organize categories themselves. This led to further new features such as

category tree extensions, which ren-der a tree of subcategories on a cat-egory page. The introduction and application of the categories were heatedly debated in some language communities, a few of which enforced restrictions on the use of categories with manual effort. The German Wikipedia community agreed on a moratorium on category use to first define some guidelines for their ap-plication. The community members manually enforced the guidelines and the moratorium.

At the time, the power to develop, enable, and disable new features lay basically with the software develop-ers. The communities around the dif-ferent language Wikipedias had no option but to accept these changes and deal with the aftermath. Tech-nical details, such as the fact that re-directs on categories did not mean that a contributor could use the re-directing category as a synonym for categorization, had enormous social impact because they formed the pat-terns of interaction with the software. (A more thorough analysis of the cat-egory system is available elsewhere.4)

Parser FunctionsDevelopers use templates to include the text of another page (usually in the separate template namespace) where the template is being called. This allows for higher consistency within the wiki, allowing some ele-ments to be written only once and reused in several articles. Templates can also feature parameters—that is, parametrized template calls will in-sert the parameter values in the re-placing text (see Figure 1). Such tem-plates are widely used in Wikipedia; for example, the English Wikipedia offers more than 200,000 templates.

In 2006, some Wikipedians dis-covered that through an intricate and complicated interplay of templating

IS-26-01-Vran.indd 57 25/01/11 12:06 PM

58 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

S o c i e t y o n l i n e , P a r t 2

features and cascading style sheets (CSSs), they could create conditional wiki text—text that was displayed if a template parameter had a specific value. This included repeated calls for templates within templates, which bogged down the whole system’s per-formance. As a result, the develop-ers had to either disallow the spread of an obviously desired feature by detecting such usage and explicitly disallowing it within the software, or offer an efficient alternative. Tim Starling offered such an alternative when he introduced parser functions (wiki text that calls functions imple-mented in the underlying software) in April 2006:

In response to a campaign by users of

the English Wikipedia to harass devel-

opers by introducing increasingly ugly

and inefficient meta-templates to popu-

lar pages, I’ve caved in and written a few

reasonably efficient parser functions.

There are two conditional functions and

a mathematical expression function.5

At first, only conditional text and the computation of simple math-ematical expressions were imple-mented, but even this increased the possibilities for wiki editors enor-mously. In time, further parser func-tions were introduced, finally lead-ing to a framework that allowed the

simple writing of extension functions to add arbitrary functionalities, such as geocoding services or widgets. The developers were clearly react-ing to the community’s demands, be-ing forced either to fight the solution to the community’s issue or offer an improved technical implementation to replace the previous practice and achieve overall better performance.

Flagged revisionsBecause anyone can edit the pages in a wiki-based encyclopedia, arti-cle defacing can and does occur. The first approach to dealing with this problem was to protect articles that tended to be defaced often (such as the main index page and pages for George W. Bush and abortion), only letting users with administrator rights edit them. Naturally, this ap-proach decreases “wikiness.” As a result, Wikipedia introduced semi-protection, making articles editable only by registered users. Preliminary research indicates that semi-protection greatly reduces risk of vandalism but does not severely affect the collabora-tive process.

Flagged revisions enable a second layer to store the ad hoc validation of pages. After a trial phase, this became a permanent feature in the German language Wikipedia, and in Febru-ary 2009, all articles in this edition

were assigned at least one flagged re-vision. The remaining work is to keep up with edits by unregistered or new users. Any Wikipedia author who is a member of the editor group can mark a revised article as compliant with a set of conditions on content quality. The current condition for such a flag is the “lack of obvious forms of van-dalism,” a deliberately low thresh-old. Unregistered Wikipedia visi-tors (the vast majority of Wikipedia visitors) are shown the latest flagged revision instead of the most recent article version.

Both prior to implementation and after the initial start of this feature, a lengthy debate was held within the author community, resulting in a straw poll. The preliminary result was a rather strong mandate to keep the feature. Other language projects were given the choice to use this ex-tension (newspapers are following the discussion on the English Wiki-pedia). Developers and the commu-nity worked closely to realize this solution. This feature is expected to make a strong impression on casual Wikipedia readers by ensuring that no spam or profanity is displayed to them. Based on the exhaustive de-bate, the feature should effectively confine vandalism, and as of now, the feature seems to be functioning as intended.

Further ExamplesSelect other features can also pro-vide worthwhile examples for fur-ther investigation. For example, some early extensions might be surprising in their scope. In one example, the Hiero wiki extension let contribu-tors use hieroglyphs within wiki text. This extension was created and advo-cated by a small user group, but it was swiftly implemented and integrated because of the early state of the over-all project. This functionality was

Greece is {{Country | Continent=Europe | Capital=Athens}}

(a)

a country in {{{Continent}}}. Its capital is {{{Capital}}}.

[[Category:Country]]

(b)

Greece is a country in Europe. Its capital is Athens.

[[Category:Country]]

(c)

Figure 1. Template use in Wikipedia. (a) The source of a page about Greece calling the country template, (b) the country template source, and (c) the text of the page about Greece after template expansion.

IS-26-01-Vran.indd 58 25/01/11 12:06 PM

JaNuarY/FEbruarY 2011 www.computer.org/intelligent 59

later refactored out of the core with-out affecting the usage in the Wiki-pedia content.

The cite extension, created by core developer Magnus Manske, lets con-tributors add citation information to an article and render the citations ap-propriately. The cite extension intro-duces new, XML-inspired syntax to the wiki text, which is inconsistent with the rest of the wiki syntax. This inconsistent syntax did not bother the community, who quickly adopted and extended the possibilities of the new extension.

The MediaWiki software is further extending the possibility of personal-ized views for logged-in users. Within the wiki, users can add new Java-Script and CSS commands that will be served with the HTML pages. Some users offer JavaScript snippets that can be integrated into their own Java-Script extension page. This frame-work allows for almost arbitrary ad-aptations of the individual Wikipedia experience, for example, by adding a more powerful editor or hiding parts of the page output. The widget exten-sion takes this a step further, letting less technically inclined users access some of these personalization fea-tures as well. These extensions decou-ple the users’ individual experiences from the community and lets them individually enhance Wikipedia with new features.

A recurring topic among devel-opers and active users has been the conversion of wiki text or rendered HTML output to PDF. Early at-tempts were mostly based on manual work (such as the WikiReader proj-ect), involving OpenOffice templates and their PDF output. These projects triggered several pieces of software that implemented a MediaWiki-to-PDF converter or a different way to create more visually appealing Wiki-pedia articles for print. In 2008, the

Wikimedia Foundation announced a partnership with PediaPress, a com-mercial partner that would create and maintain the software to create the output. This demonstrates that mod-ules developed outside of the Media-Wiki open source developer commu-nity can be integrated into Wikipedia.

Semantic AnnotationsThe Semantic MediaWiki project lets contributors add structured data to wiki pages and exploit these both within the wiki through query-ing and browsing and outside of the wiki using the Resource Description Framework (RDF) export of the an-notation structure.6 To create the structured data, a MediaWiki syntax extension is employed to allow for typed links and further annotations. On the page for Greece, for example, the wiki text [[Capital::Athens]] would be interpreted as the triple Greece-Capital-Athens.

It was soon noted that these an-notations make the wiki text harder to read and might confuse and dis-courage novice contributors. To solve this, editors often moved the anno-tations from the text to templates.

For example, on the Greece page, the country template can be called using the parameter Capital: {{Country|Capital=Athens}}.

The template can now create a consistent layout for country pages, for example, by including an info-box displaying the capital. In Se-mantic MediaWiki, the template can also annotate the country article with the capital, thus moving the ad-ditional annotation syntax from the article to the template. This shifts the burden from the casual Wiki-pedia editor to people who are com-fortable with editing the complex template syntax. This is not a tech-nically enforced trade-off, however; the annotation can be in either the article or template. The community must choose how to enforce anno-tated links.

Instead of annotating links, the community might choose to put the annotations at the bottom of the page, similar to how category anno-tations are used today. So, articles would still include the same annota-tion, but with no link at that loca-tion. This decouples the annotations from the text (see Figure 2). Figure 2b

Greece is a [[Europe]]an country.

It’s capital is [[Athens]].

[[Category:Country]]

(a)

Greece is a [[continent::Europe]]an country.

It’s capital is [[capital::Athens]].

[[Category:Country]]

(b)

Greece is a [[Europe]]an country.

It’s capital is [[Athens]].

[[Category:Country]]

[[Continent::Europe]]

[[Capital::Athens]]

(c)

Figure 2. Source for the Wikipedia page about Greece. (a) Plain MediaWiki, (b) Semantic MediaWiki, and (c) Semantic MediaWiki using category-style annotations.

IS-26-01-Vran.indd 59 25/01/11 12:06 PM

60 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

S o c i e t y o n l i n e , P a r t 2

shows the links annotated in the text, whereas Figure 2c shows all annota-tions at the end. The decoupling has a different trade-off: Having the links annotated in the text increases con-sistency (because changing the capi-tal of Greece from Athens to Sparta would change both the text and the annotation), while having the anno-tated links at the end increases read-ability. Again, this is a social choice, and the technology allows for both options, simplifying editing or in-creasing consistency. But this might only be the microview of the issue; that is, simplified editing could in turn lead to increased community participation, which might eventually lead to more consistency due to an in-creased number of eyes. It is difficult to predict the sociotechnical impact of such a choice.

The DBpedia project offers an alternative approach to extracting data from Wikipedia.7 DB pedia extracts data from the existing templates without changing Wiki-pedia’s technical component at all. Instead, it creates a strong incen-tive for standardizing and con-trolling the use of templates with much more scrutiny. The success of DBpedia might lead to more pressure to change how templates are used to improve information extraction, thus pushing for more of a social change than the tech-nical change that Semantic Media-Wiki requires. This approach also involves bigger technical changes by implementing Semantic Media-Wiki on the Wikipedia servers,

but it would eventually lead to larger changes in how the commu-nity applies templates.

We are convinced that we can learn from analyzing the

introduction of new features in Wiki-pedia. Specifically, this preliminary survey offers the following lessons:

• The lack of some technical fea-tures can lead to the misuse of other existing features to simulate the desired functionality (as exem-plified by the use of backlinks for categories).

• Common practices within a com-munity might necessitate changes in the technical platform to meet technical constraints without changing the practice of the com-munity too much (as seen for parser functions).

• Introducing a new feature might lead to a huge perceived change in the mission and scope of the whole system (a challenge posed by the flagged revisions).

• In a sociotechnical system, imper-fections in the underlying tech-nical platform can be compen-sated by the community (as with the unusual syntax of the cite extension).

• A priori design of new features must resolve the technical chal-lenges while also considering pos-sible social implications.

Based on these findings, we have used the Web science process model’s

descriptive power for both the a pos-teriori analysis and a priori design of new Wikipedia features. Our current plan for the eventual implementa-tion of Semantic MediaWiki in Wiki-pedia is to openly discuss the differ-ent alternatives with the community and let them decide on the actual implementation.

AcknowledgmentsThis work was partially funded by the European Union under EU IST FP7 project Active (www.active-project.eu).

References1. P. Ayers, C. Matthews, and B. Yates,

How Wikipedia Works, No Starch

Press, 2008.

2. J. Hendler et al., “Web Science: An

Interdisciplinary Approach to Under-

standing the Web,” Comm. ACM,

vol. 51, no. 7, 2008, pp. 60–69.

3. T. Berners-Lee, “The Two Magics of

Web Science,” keynote speech, 16th

Int’l World Wide Web (WWW) Conf.,

2007, www.w3.org/2007/Talks/0509-

www-keynote-tbl.

4. J. Voss, “Collaborative Thesaurus Tag-

ging the Wikipedia Way,” tech. report

abs/cs/0604036, Computing Research

Repository, 2006.

5. T. Starling, “Mathematical Expressions

and Conditional Constructs, Now Imple-

mented,” 2006; http://lists.wikimedia.

org/pipermail/wikitech-l/2006-April/

022341.html.

6. M. Krötzsch et al., “Semantic Wiki-

pedia,” J. Web Semantics, vol. 5, 2007,

pp. 251–261.

7. S. Auer et al., “DBpedia: A Nucleus for

a Web of Open Data,” Proc. 6th Int’l

Semantic Web Conf., Springer, 2007,

pp. 722–735.

t h e a u t h o r SMathias Schindler is a project manager at Wikimedia Deutschland. His research inter-ests include authority files and bibliometrics. Schindler studied at the University of Frank-furt. Contact him at [email protected].

Denny Vrandecic’ is a postdoctoral researcher at the Karlsruhe Institute of Technology, Germany. His research interests include massive collaboration and the Semantic Web. Vrandecic has a PhD from the Karlsruhe Institute of Technology. Contact him at [email protected].

Selected CS articles and columns are also available for free at

http://ComputingNow.computer.org.

IS-26-01-Vran.indd 60 25/01/11 12:06 PM

IS-26-01-Vran.indd 61 25/01/11 12:06 PM