Post on 25-Feb-2023
Assessing the impact of demographic characteristicson spatial error in volunteered geographic informationfeatures
William F. Mullen • Steven P. Jackson •
Arie Croitoru • Andrew Crooks •
Anthony Stefanidis • Peggy Agouris
� Springer Science+Business Media Dordrecht (outside the USA) 2014
Abstract The proliferation of volunteered geo-
graphic information (VGI), such as OpenStreetMap
(OSM) enabled by technological advancements, has
led to large volumes of user-generated geographical
content. While this data is becoming widely used, the
understanding of the quality characteristics of such
data is still largely unexplored. An open research
question is the relationship between demographic
indicators and VGI quality. While earlier studies have
suggested a potential relationship between VGI
quality and population density or socio-economic
characteristics of an area, such relationships have not
been rigorously explored, and mainly remained
qualitative in nature. This paper addresses this gap
by quantifying the relationship between demographic
properties of a given area and the quality of VGI
contributions. We study specifically the demographic
characteristics of the mapped area and its relation to
two dimensions of spatial data quality, namely
positional accuracy and completeness of the corre-
sponding VGI contributions with respect to OSM
using the Denver (Colorado, US) area as a case study.
We use non-spatial and spatial analysis techniques to
identify potential associations among demographics
data and the distribution of positional and complete-
ness errors found within VGI data. Generally, the
results of our study show a lack of statistically
significant support for the assumption that demo-
graphic properties affect the positional accuracy or
completeness of VGI. While this research is focused
on a specific area, our results showcase the complex
nature of the relationship between VGI quality and
demographics, and highlights the need for a better
understanding of it. By doing so, we add to the debate
of how demographics impact on the quality of VGI
data and lays the foundation to further work.
Keywords Volunteered geographic information �OpenStreetMap � Spatial analysis � Spatial data
quality � Demographics
W. F. Mullen (&) � S. P. Jackson � A. Croitoru �A. Stefanidis � P. Agouris
Department of Geography and GeoInformation Science,
George Mason University, 4400 University Drive,
MS 6C3, Fairfax, VA 22030-4444, USA
e-mail: wmullen@gmu.edu
A. Croitoru
e-mail: acroitor@gmu.edu
A. Stefanidis
e-mail: astefani@gmu.edu
P. Agouris
e-mail: pagouris@gmu.edu
A. Crooks
Department of Computational Social Science, Krasnow
Institute for Advanced Study, George Mason University,
4400 University Drive, MS 6B2, Fairfax, VA 22030-4444,
USA
e-mail: acrooks2@gmu.edu
123
GeoJournal
DOI 10.1007/s10708-014-9564-8
Introduction
Recent years have been characterized by a significant
shift in the way geographic information is produced,
specifically through the rise of volunteered geographic
information (VGI; Goodchild 2007), which has been
fueled through Web 2.0 technologies (Hudson-Smith
et al. 2009). While this shift has resulted in an increase
in the volume and richness of geographic data, it is also
posing some significant challenges with respect to
evaluating the quality of such data. The fact that
volunteers with minimal (if any) geographic training
are contributing such information (Mooney et al. 2010),
as well as the mechanisms that govern the contribution
process, brings to question the quality of such data (Sui
2008). An exemplar of this trend is OpenStreetMap
(OSM), an open source collaborative mapping project
that aims to generate an editable global map database
(Mooney and Corcoran 2012). Earlier studies address-
ing the quality, in particular positional accuracy, of
VGI with respect to OSM have pointed out the lack of
homogeneity (e.g. Hochmair and Zielstra 2013; Kouk-
oletsos et al. 2012). In order to understand these issues,
researchers have explored the relationship of VGI data
and its quality with various geographical and demo-
graphic indicators. Studies addressing the relation
between population density and VGI have indicated
that in densely populated urban areas more contribu-
tions can be expected, which may ultimately lead to
higher quality (Haklay 2010). At the same time, Girres
and Touya (2010) noted that population density is not
the only factor controlling the positional accuracy of
VGI, indicating that areas with higher income and
younger population are characterized by higher number
of contributions. Further efforts to identify discernible
spatial patterns in VGI quality (e.g. by comparing the
topology of volunteered road network data and com-
mercially available data) did not find statistically
significant correlation (Neis et al. 2011).
A common assertion that has emerged from such
studies is that the demographic characteristics of the
contributing volunteers may impact the distribution of
the positional and shape errors in VGI (Fairbairn and
Al-Bakri 2013). Certain demographic indicators that
have been explicitly suggested as potentially contrib-
uting to data quality patterns, include race and
economic status (e.g. Tulloch 2008; Elwood 2008;
Graham 2005; Zook and Graham 2007a, b; Crutcher
and Zook 2009). However, these earlier studies were
primarily qualitative in nature and were not accom-
panied by rigorous quantitative analyses. Motivated
by this gap, this paper aims to examine the quantitative
relationships between VGI quality—in particular
positional accuracy and completeness—and demo-
graphic properties. We do so through a case study in
Denver, Colorado (CO), where we use VGI data and
contrast it with local demographics from the United
States (US) Census Bureau. The remainder paper is
organized as follows: in section ‘‘Background and
motivation’’ we provide background information with
respect to the demographic characteristics of VGI. In
section ‘‘Data and methods’’ we present the back-
ground and rationale for our approach to assess the
impact of demographic variation on the quality of
VGI. In section ‘‘Results and discussion’’ we present
the results of our analysis, and conclude with our
summary and outlook in section ‘‘Discussion’’.
Background and motivation
When attempting to assess the quality of VGI content
we need to be cognizant of the particular nature of the
volunteering process that differentiates VGI contribu-
tions from the traditional and established processes
through which geographical data had been collected.
While VGI was enabled by technological advance-
ments, it gained popularity primarily because it
addressed the general public’s growing need to access
geographical data for a constantly increasing array of
activities. This is the reason why OSM, the prototyp-
ical example of VGI, emerged and grew in the United
Kingdom, where geographical data were not as freely
distributed by government agencies (Haklay and
Weber 2008). In contrast, the US government policy
has leaned more towards openly sharing much of its
geographical data. Up to this point, efforts to assess the
quality of OSM content have focused primarily on road
networks, as OSM was intended after all to be a ‘street
map’. Such studies have compared for example OSM
road data to a reference data set, in order to assess OSM
relative to an authoritative standard such as the U.K.
Ordnance Survey road network data sets or comparable
products (Brown and Pullar 2012; Haklay 2010;
Hochmair and Zielstra 2013; Koukoletsos et al. 2012;
Neis and Zipf 2012; Neis et al. 2011). However, as
OSM has evolved well beyond streets we are now in
need to assess the accuracy of other types of features as
GeoJournal
123
well, in order to gain a more thorough understanding of
quality issues with respect to VGI. This need is further
emphasized by the fact that large volumes of road data
(e.g. TIGER/line files), which were made freely
available by the U.S. Census, have been bulk-uploaded
into OSM since 2009 (OSM 2013b). Accordingly, road
content in OSM (especially in the US) does not
comprise solely VGI contributions, but a hybrid
aggregate of VGI and authoritative data. Therefore, a
study of non-road features (especially ones that do not
include bulk uploads of authoritative content) will
complement the current body of work and enhance our
understanding of the accuracy of OSM contributions,
and this is one of the contributions of this paper.
Furthermore, there is a lack of understanding with
respect to the spatially-driven motivations for VGI
contributions: why do people contribute for some
areas and not others? For example Zielstra and Zipf
(2010) contrasted the differences between VGI and
commercial data sources in Germany against popula-
tion density, noting that the completeness of the VGI
degraded considerably as the distance from the urban
core increased. Conversely, rural areas with low
population densities are less likely to attract VGI
contributions. However, our understanding of how this
participation varies among locations with similar
population density remains largely unexplored. For
example, Goodchild and Li (2012) found that Linus’
Law is not as effective for geographic facts as it is for
other information like Wikipedia, and that population
density alone is not sufficient to explain the trends in
the data error. A study of the quality of contributions
as it relates to the demographics of the place will
therefore advance our understanding of the relation
between population characteristics and VGI quality,
and this is the second contribution of this paper. In the
following two subsections we provide review of
accuracy and completeness in the context of OSM
(section ‘‘Positional accuracy and completeness in
OSM’’), followed by a discussion of the characteristics
of the contributors to VGI—both in terms of their
motivation and their demographics (section ‘‘The
motivation to contribute to open source initiatives’’).
Positional accuracy and completeness in OSM
In the context of this paper the term accuracy is used to
refer to positional accuracy, namely the closeness of
the coordinate values of a VGI feature (e.g. a point) to
its corresponding authoritative equivalent feature
based Euclidian distance. The term completeness
refers to the extent to which features are included or
omitted from a dataset, again in comparison to the
authoritative equivalent. In that sense, if one were to
consider schools, accuracy would refer to how close a
VGI record of a school is to its corresponding
authoritative record, whereas completeness would
refer to the percentage of schools that have been
mapped in a VGI dataset. Both terms are indicators of
the overall quality of a dataset, a term which in this
paper is used to refer to the overall fitness for use of a
dataset.
Past efforts to assess the quality of VGI contri-
butions have primarily focused on positional accu-
racy. Haklay (2010) compared OSM data and the
United Kingdom’s (UK) Ordnance Survey road
centerline data, finding that OSM data road center-
lines are displaced on average by 5.83 m for selected
areas within London. Girres and Touya (2010)
compared French road data and reported an average
displacement of 6.65 m, which is consistent with
Haklay (2010). These studies however, did not
examine any spatial variations of accuracy. In an
effort to address this issue, Al-Bakri and Fairbairn
(2010) performed a localized study in the UK,
assessing the accuracies of nodes and linear seg-
ments of polygonal VGI entries. They found that
node accuracy degraded between urban and the more
rural ‘peri-village’ areas, noting errors of 9.6 and
11.0 m respectively. For linear segments, the authors
measured an average displacement from the sur-
veyed results of 1.5 m using uniformity of buffers
established for both the reference and OSM data.
The spatial heterogeneity of VGI errors was also
pointed out by other researchers who noted that
inaccuracies were often localized to specific areas
(e.g. Girres and Touya 2010; Haklay 2010; Zielstra
and Zipf 2010). In an effort to understand these
accuracy variations, Haklay et al. (2010) showed that
there was no observable correlation between the
number of contributors and quality of VGI data once
that number reaches a certain level. Considering the
lack of discernible spatial patterns of accuracy
variations, the research community turned its atten-
tion to the motivation behind VGI contributions, as a
potential explanation for these variations.
GeoJournal
123
The motivation to contribute to open source
initiatives
Open source initiatives exceed the purview of the
geographical community, with similar efforts seen in a
very broad range of activities, ranging for example,
from Wikipedia and open source software develop-
ment projects to citizen science. Nevertheless, regard-
less of the topic, there exist certain general motivating
factors that drive participation in such efforts. For
example, Oreg and Nov (2008) identified the three
primary motivational elements for contribution,
ranked by priority of importance, as: self-develop-
ment, altruism, and reputation building. Kuznetzov
(2006) argued that for Wikipedia, perhaps the ‘poster
child’ for crowdsourcing efforts, the key motivation is
a sense of community and accomplishment. Vickery
and Wunsch-Vincent (2007) noted that the motiva-
tions driving these contributions of user-generated
content are primarily technological, social, economic,
and institutional or legal considerations. In essence, if
it’s not difficult, illegal, or inconvenient to contribute
the information and there is some motivation in the
form of intrinsic social or economic return achieved by
the contributor, then there will be information con-
tributed (Nov et al. 2011).
While the above drives public participation, the
very nature of public participation introduces biases in
volunteered content. These biases result in variations
in the patterns of contribution and the accuracy of the
contributed content itself. Such biases stem from four
key areas: Internet access, knowledge of language,
available time, and adequate technical capability to
support the editing functions required of the contrib-
utors (Holloway et al. 2007). As these areas are closely
associated with demographic properties, it is only
natural to move towards studying VGI content relative
to the corresponding demographic information of an
area, which is a direction that we pursue in this paper.
This is also consistent with earlier work by Porter and
Donthu (2006) and Longley and Singleton (2009),
who also identified the correlation between Internet
usage and several demographic indicators (e.g. age,
education, income and race).
When we consider the above under the lens of VGI
contributions in particular, an argument emerges about
the relation between VGI contributions and demo-
graphics. Elwood et al. (2013) note that differences in
social processes associated with VGI can impact the
content and quality of the contributed data. Studies of
OSM contributions have also noted that as the
population density decreases so do OSM contributions
(e.g. Zielstra and Zipf 2010; Girres and Touya 2010;
Haklay 2010; Zielstra and Zipf 2010). However,
population density itself is not the only factor affecting
content and quality (Girres and Touya 2010; Schmidt
and Klettner 2013) as it is not uncommon to have
information gaps in highly populated areas (Cipeluch
et al. 2010). Girres and Touya (2010) and Haklay
(2010) noted a significant decrease in contributions for
areas that were economically deprived or disadvan-
taged. This suggests that socioeconomic factors, such
as income and educational achievement may affect
OSM contributions, leading to complex spatial pat-
terns of participation (Elwood 2008, 2009; Ghose and
Elwood 2003; Sieber 2006; Tulloch 2008). Graham
(2005) along with Zook and Graham (2007a, b) noted
similar impacts as spatial queries and ‘software-
sorting’ techniques influenced by cultural differences
which can create or enhance a bias in digital presen-
tation or ‘perception’ of a place. Furthermore, Crut-
cher and Zook (2009), in a study of geographical data
in the context of reporting the impacts of the 2005
Hurricane Katrina in New Orleans, Louisiana, stated
that racial inequities were a key factor affecting access
and use of digital technologies. However, while there
have been a number of suggestions regarding the
impact of the demographic element on the contribu-
tion patterns and quality of VGI data, this direction
remains understudied, and in this paper we make a
contribution by addressing this issue.
Data and methods
The problem of exploring possible relationships
between population demographics and VGI quality
gives rise to significant challenges, both in terms of the
required data and in terms of the analysis methods that
are utilized. In terms of data, the study of demograph-
ics and VGI quality requires the availability of detailed
demographics data that can be spatially related to a
given study area. At the same time, it is also necessary
derive VGI quality measures for the same study area.
To derive such measures both VGI and reference data
are required so that both positional accuracy and
completeness could be calculated. At the analysis
level, several key aspects should be considered. First,
GeoJournal
123
the derivation of VGI quality measures requires a
reliable method for conflating (i.e. matching geo-
graphic features between) the VGI data and the
reference data. Second, as the possible relations
between demographics and VGI quality can be
considered as both non-spatial and spatial phenomena,
both non-spatial and spatial analysis methods should
be applied. When studied as a spatial phenomenon, the
possible relations between demographics and VGI
quality can emerge both at the local and the global
geographic scales, requiring both types of spatial
analysis. Finally, the high multidimensionality of
demographics data and its type heterogeneity (e.g.
data can include ranked-scale variables or interval-
scale variables) requires both the ability to reduce such
dimensionality and account for the various data scales.
In view of these challenges, it is necessary to employ
multiple data sources as well as an array of both non-
spatial and spatial analysis methods. An overview of
the data sources used in this study as well as the
workflow that combines the different analysis steps
and methods is shown in Fig. 1. In section ‘‘VGI
quality and demographics data’’ below we describe the
data sources (reference, VGI and demographics) used
in this workflow, and in section ‘‘Analysis methods’’
we describe the methods used to explore possible
relations between demographic indicators and VGI
quality.
VGI quality and demographics data
In order to explore the relation between demographics
and VGI quality we build upon the previous work of
Jackson et al. (2013), who proposed a methodology for
quantifying the completeness and accuracy of point
datasets in VGI data. In particular, given a VGI data
set and a reference data set, the proposed methodology
assessed completeness through a multi-step matching
process that attempts to find for each feature in the
reference data set a matching feature in the VGI data
set. This results in matched features (for which a
match in the VGI dataset was found) and unmatched
features (for which a match in VGI dataset was not
found). The rates of matched and unmatched feature in
the VGI dataset can then be transformed into com-
pleteness rates based in the reference data set.
However, that paper did not explore the relation
between demographics and VGI quality. Conse-
quently, the goal of research presented here is to
investigate this relationship. In order to accomplish
our objective we leverage the same datasets used in
Jackson et al. (2013) and augment it with demographic
information. The study area covers a large percentage
of the City and County of Denver, including down-
town Denver, extends into portions of the surrounding
Arapahoe, Jefferson, and Adams Counties as shown in
Fig. 2. Specifically the following datasets were used:
Fig. 1 The analysis workflow
GeoJournal
123
Oak Ridge National Laboratory (ORNL) data,
which geographically located a Department of
Education list of schools using the street address
information. The ORNL data locates a point for
each school at the street address.
School locations from the Points of Interest (POI)
layer of OSM. The POI layer represents each
specific feature as a node. In OSM, it is common
practice to represent area features as points (Over
et al. 2010) and the guidelines provided within OSM
indicate that the node should be placed in the middle
of the site (OSM 2013a).
Data from the United States Geological Survey
(USGS) OSM Collaborative Project (OSMCP—
2nd Phase; Poore et al. 2012).
We consider the ORNL data as the authoritative
dataset in our analysis while the remaining two
Fig. 2 The study area
GeoJournal
123
datasets represent different types of VGI data. The
OSM POI school data represents ‘classic’ VGI while
the OSMCP dataset represents a variant of VGI data in
the sense that it introduces limited authoritative
oversight to the VGI process, in the form or peer-
developed quality control feedback to the volunteers
(e.g. university students) as well as USGS feedback
(Poore et al. 2012). This is part of a larger effort by the
USGS to augment the National Map (2014) with VGI
data pertaining to manmade structures (e.g. schools,
hospitals, post offices, police stations etc.). As one
might expect, the OSMCP effort would likely fall in
the Expert Professionals range of the scale proposed
by Coleman et al. (2009) since the final review of the
data is conducted by USGS personnel. This has also
been shown in Jackson et al. (2013), who indicated
that OSMCP data was of higher quality both in terms
of completeness and positional accuracy compared to
OSM data.
In order to assess the relationship between demo-
graphics and VGI quality the above-mentioned data-
sets were augmented with demographic data. Such
data can be drawn from census and provide us with the
respective characteristic properties (e.g. age, race,
ethnicity, economic status, or gender) of the popula-
tion under study (McKechnie 1983) at a specific point
in time. In the US, the US Census Bureau has the
responsibility to compile and publish the official
demographic data, including conducting the decennial
Census, at a variety of spatial scales (with Census
blocks being the finest). Within our study we use
Census tracts rather than Census blocks, as the latter
lacks certain demographic information (e.g. ethnicity)
due to privacy concerns. For example, the Census
blocks could potentially lead to identification of
individual households or businesses (US Census
Bureau 2012). At the Census tract level demographic
information such as: age, sex, education, employment,
ethnicity, immigration, income, marital status, popu-
lation, poverty level, and race are made available
through a spatial join operation. Consequently, census
tract data provides us with a rich source of georefer-
enced demographic data, which we can use for
assessing how demographics impact on the quality
of VGI data. As discussed above, prior research
suggests that four broad demographic categories (i.e.
population, economic status, educational achieve-
ment, and race/ethnicity) should be used to assess
the impact of demographics on VGI accuracy and
completeness. From among these categories we
selected a set of 18 demographic indicators to use in
our study as shown in Table 1.
In conjunction with these indicators, the complete-
ness and positional accuracy of the OSM and OSMCP
were evaluated. For this purpose, the method
described in Jackson et al. (2013) was used to derive
the positional accuracy and completeness of each of
OSM and OSMCP datasets with respect to ORNL
dataset (which was considered as the authoritative
source). This analysis resulted in completeness rates
for each dataset (89 % for OSMCP vs. ORNL and
72 % for OSM vs. ORNL) and positional error
distributions (47 m ± 50 m for OSMCP vs. ORNL
and 190 m ± 314 m for OSM vs. ORNL). Further-
more, 59 % of the time OSMCP schools were closer to
their ORNL reference entries, compared to their OSM
counterparts, suggesting that that OSMCP outper-
forms OSM. These results, together with the demo-
graphics indicators, were then used to analyze possible
relations between VGI quality and demographics.
Analysis methods
Our analysis is carried out in two modes: non-spatial
and spatial. While in the non-spatial analysis mode we
focus on the descriptive statistical characteristics of
the relationship between demographics and quality, in
Table 1 Census tract level demographic indicators considered
in our study across four categories: population, economic sta-
tus, education, and race/ethnicity
Population Education
Total population Percent without high school (HS)
diploma
Population density Percent with HS diploma (and over
25 years old)
Median age Percent with BA degree (or better)
(and over 25 years old)
Percent male Race/ethnicity
Percent female Percent white
Economic status Percent white—not Hispanic
Median household
income
Percent African American
Median home value Percent American Indian
Percent below poverty
line
Percent Asian
Percent of homes
receiving food stamps
Percent Hispanic
GeoJournal
123
the spatial analysis mode we focus on the geographic
properties of this relationship. Below, we provide a
concise description of each analysis mode (non-spatial
and spatial) with a particular emphasis on the spatial
analysis methods used.
Non-spatial analysis
In the non-spatial analysis mode we explore the
relations between VGI quality and demographic
indicators in four steps. First, we compare the
distribution (i.e. histogram) of quality measured with
respect to the distribution of the demographic indica-
tors in order to determine whether accuracy errors or
completeness errors are generated in tracts that fit
specific demographic profiles. As some demographic
indicators are continuous variables, a method for
binning the data is required. In this study we approx-
imate the distribution of the demographic indicators
using histograms using a binning process. A number of
statistical approaches have been developed to bin data,
which generally fall into four categories: natural
(Jenks); quantile; equal-interval; and standard devia-
tion (Longley et al. 2010). While each of these
approaches could be considered, no substantial differ-
ences in the results were observed between them in our
case study. Consequently, natural (Jenks) breaks
within each data element are used to establish five
ranges within each of the selected demographic
indicator to enable the exploration of emergent
patterns in the distribution of a VGI quality measure
across a demographic parameter (e.g. OSM complete-
ness vs. median age).
Following this initial analysis, Principal Compo-
nent Analysis (PCA; Press and Wilson 1978) and
discriminant analysis (Davis 1973) is applied in order
to identify the demographic indicators that would best
explain quality variations in the two VGI sets.
Through this analysis, PCA can potentially lead to a
reduction in the dimensionality of the demographic
indicators, while retaining as much as possible of the
variation present within the original dataset. This is
achieved by transforming the original dataset to a new
set of variables, the principal components (PCs),
which are uncorrelated but are ordered so that the first
few retain most of the variation present in all of the
original variables (Andrews et al. 1996). As PCA can
be sensitive to the scale of the variables, it may be
necessary to rescale the demographic indicators.
However, most of these indicators are already cap-
tured as percentages, therefore such scaling is not
necessary. Specific non normalized indicators, such as
total population, population density, median age, and
median household income, must be scaled to fit within
the range of 0–1 prior to the application of the PCA.
The transformation to compute these new scaled
values is straight-forward: the minimum and maxi-
mum attributes are determined, and then the minimum
is subtracted from the maximum to establish the range,
finally the minimum is subtracted from the actual
value and the result is divided by the range, yielding
scaled value for that attribute. PCA is particularly
useful as a starting point for additional detailed
analysis, such as linear regression, in that the PCA
results may identify multi-colinearities within the data
that can then be used to reduce the data dimensionality
prior to additional analysis. By eliminating such
colinearities, linear regression can be simplified.
In order to explore and quantify possible correla-
tions between demographic indicators and VGI quality
measures both linear regression and discriminant
analysis are used. Specifically, linear regression is
used to assess the correlation between VGI positional
accuracy and demographic properties (Burt et al.
2009), and discriminant analysis (Davis 1973) is used
to assess the correlation between VGI completeness
and demographic properties. In the linear regression
analysis positional accuracy is used as the dependent
variable and the demographic properties as the
explanatory variables, and R2 values are calculated
for both the OSM and OSMCP datasets. In the
discriminant analysis completeness is used as the
discriminate function, demographic properties are
used as the explanatory variables, and Wilks’ lambda
(Huberty 1984) is used to estimate the significance of
the discriminant function. Though the combination of
linear regression and discriminant analysis we are
therefore able to assess how both positional accuracy
and completeness relate to the various demographic
indicators in the study area.
Spatial analysis
In the spatial analysis mode we analyze the spatial
properties of the completeness and positional accuracy
that were derived earlier. This analysis is carried out in
two steps. In the first step we explore whether spatial
patterns emerge in the VGI quality measures, and if so,
GeoJournal
123
whether such patterns can also be found in the spatial
distribution of the studied demographic indicators. If
found, the existence of spatial patterns in both data sets
would provide supporting evidence to the possible
association between VGI quality and demographic
indicators. However, if patterns exist in one dataset
and not in the other then such association would be
unlikely. In this analysis, the term pattern refers to a
non-random spatial distribution of the studied vari-
ables that is statistically significant (based on a
relevant significance test). To accomplish this we
utilize Nearest Neighbor analysis (NN; Clark and
Evans 1954), Moran’s Index of Spatial Autocorrela-
tion (Moran’s I; Moran 1950), and Local Indicators of
Spatial Associations (LISA; Anselin 1995) to test for
statistically significant spatial patterns. In the second
step, we utilize regression to model any spatial
associations between the VGI quality measures and
demographic indicators. Specifically, we use explor-
atory regression as well as Geographically Weighted
Regression (GWR: Fotheringham et al. 2002) to
determine whether a statistically significant model
(global or local) can be derived. We briefly describe
each of these analysis methods below.
NN analysis utilizes the average distance to the
nearest neighbor data point to determine whether a
point distribution is spatially random, clustered, or
dispersed compared to a random point pattern. Based
on this, the determination whether a given dataset is
randomly distributed is made through a z-score statis-
tic: absolute score values above 2.58 indicate a 99 %
chance that the data is not randomly distributed, while
values above 1.96 indicate a 95 % chance. Using the
NN analysis, both completeness and positional accu-
racy can be analyzed for the OSM and OSMCP to
determine whether any spatial patterns can be detected
based on VGI completeness rates. While the NN
analysis addresses the question of whether global
spatial patterns exist in the quality of the two VGI
datasets studied here, it does not address the question of
whether such patterns exist at the local scale. To
address this, we utilize LISA in order to identify areas
where local clusters of higher or lower than expected
values exist (Anselin 1995). Once a dataset is shown to
exhibit spatial autocorrelation, LISA can identify
significant areas within the overall dataset that generate
such spatial autocorrelation. Accordingly, within our
analysis LISA is used to determine which, if any,
locations drive the spatial autocorrelation values that
are observed. Similarly to the NN analysis, the
positional accuracy and completeness measures from
both OSM and OSMCP will be used.
In conjunction with the NN and LISA analyses for
the VGI quality measures, similar analysis is carried
out for the demographic indicators in order to explore
whether any spatial patterns emerge. Specifically, we
are interested in determining whether demographic
indicators exhibit any significant spatial autocorrela-
tion (Burt et al. 2009) in the study area. As the
demographic indicators are associated with aerial
features (i.e. tracts) we utilize Moran’s I to evaluate
the global spatial autocorrelation between tracts.
Similarly to the NN analysis, Moran’s I analysis
determines whether a set of features and their attribute
values are spatially randomly distributed, clustered, or
dispersed. In addition to considering the attribute
values of features to calculate autocorrelation, Mor-
an’s I takes into the spatial relationships between
features in the form of a spatial weights matrix, which
can express both topological and metric spatial
relations (Wong and Lee 2005). The statistical signif-
icance of Moran’s I is evaluated using a z-score test.
After exploring the existence of spatial patterns in
VGI quality measures and demographic indicators in
the study area, our analysis turns to explore whether
VGI quality could be modeled, either at the global or
the local scale. At the global scale we utilize in this
research exploratory regression (de Smith et al. 2007;
Braun and Oswald 2011) and GWR at the local scale.
Exploratory regression was carried in order to identify
an appropriate regression model that utilizes the
distribution of the regression errors to assess the
overall regression quality. Building on Ordinary Least
Squares (OLS), exploratory regression considers non-
spatial regression quality indicators (e.g. R2 and
residuals distribution) as well as the spatial autocor-
relation within the residuals in order to evaluate the
quality of a given regression model. The premise
behind this approach is that a regression model should
exhibit both good fit to the explanatory variables,
normally distributed residuals (evaluated using the
Jarque–Bera p value), and a lack of spatial autocor-
relation between residuals (evaluated using a p-value
condition on global Moran’s I test). In addition to the
R2 and the p-values criteria, additional threshold
conditions on the quality of the regression can be
applied, for example a minimum number of model
variables and a maximum Variance Inflation Factor
GeoJournal
123
(VIF; Wheeler and Tiefelsdorf 2005). The exploratory
nature of this approach stems from its iterative
implementation, in which different combinations of
the explanatory variables are tested until the best
model is found. This analysis was implemented for
OSM and OSMCP quality measures using the
modeling thresholds in Table 2.
At the local scale, GWR is utilized to study spatially
localized regression models of the relation between
VGI quality measures and demographic indicators.
While based on principles similar to the global
exploratory regression, GWR explores local regres-
sion models by allowing the regression coefficients to
vary based on location in the study area. This is
accomplished by a location-dependent weight matrix
that expresses the spatial relations between data
elements. As a result, we are able to explore whether
there are any significant local associations between
VGI quality and demographic indicators.
Results and discussion
Following the workflow and analysis methods
described in section ‘‘Data and methods’’, in this
section we outline and discuss the results that were
obtained through each analysis method. In particular,
the results of the non-spatial analysis mode is
described in section ‘‘Non-spatial analysis results’’,
and the results if the spatial analysis mode is described
in section ‘‘Spatial analysis results’’. The different
analyses were carried out using the SPSS1 statistical
analysis package (for non-spatial analysis) and the
ArcGIS 10.1 Spatial Statistics Toolbox.2
Non-spatial analysis results
Histogram analysis
The statistical distribution of the VGI quality mea-
sures can be analyzed through the study of histograms
in an effort to determine whether positional accuracy
or completeness are generated in tracts that is char-
acterized by a specific demographic profile. Figure 3
illustrates this process using a sample histogram the
percent white (non-Hispanic) demographic indicator
versus completeness rates (i.e. percent of matched
features) for both the OSM and OSMCP data sets. This
histogram, in which natural breaks bins were used,
provides a qualitative yet fast way to gain some initial
insights with respect to any noteworthy trends in the
relation between a given quality measure and a
demographic indicator. Additionally, differences
between the OSM and OSMCP data sets with respect
to a given demographic indicator can be observed
using such a histogram: when the difference between
bars of the same bin is small, then the two datasets
exhibit a similar behavior. Conversely, large differ-
ences between bars of the same bin imply that the
demographic characteristic being examined may
cause a difference between the results in the compar-
ison. A pattern can then emerge if the overall behavior
of one histogram across the bins is different from the
behavior of the second histogram across the same bins.
As can be seen from the example in Fig. 3, while there
are some differences between the two histogram bars,
it is difficult to identify a clear pattern.
In our analysis, the process of creating the histo-
gram was repeated for each demographic indicator in
an attempt to identify any visible trends or patterns. A
summary of the results of this process is provided in
Table 3 for completeness rates in both the OSM and
OSMCP data sets. An inspection of these results
shows that, generally, none of the demographic
indicators considered exhibit a clear trend or pattern.
Principal component analysis (PCA)
As described earlier, PCA is utilized in this study as a
tool for reducing the dimensionality of the data
potentially simplify the linear regression analysis.
Accordingly, PCA was used to analyze the relation-
ship between the eighteen demographic indicators
presented in Table 1 to determine whether all of them
Table 2 Summary of the exploratory regression criteria
values
Criteria Threshold
Minimum adjusted R-squared [0.50
Maximum p value \0.05
Maximum VIF value \7.50
Minimum Jarque–Bera p value [0.10
Minimum spatial autocorrelation p value [0.10
1 http://bit.ly/1pFDgnd.2 http://bit.ly/1oTz1EL.
GeoJournal
123
Fig. 3 Analyzing the histogram of percent white (non-Hispanic) versus completeness rates (percent of matched features from the
ORNL data set) in the OSM and OSMCP data sets
Table 3 Summary of the histogram comparisons completeness in the OSM and OSMCP data sets (NV indicates no value)
Demographic indicator OSMCP completeness rates OSM completeness rates
Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 Bin 1 Bin 2 Bin 3 Bin 4 Bin 5
Total population 80.0 91.8 87.6 90.6 87.5 55.0 74.5 67.4 76.4 72.9
Population density 91.1 90.7 92.2 74.4 50.0 65.6 74.8 73.0 71.8 50.0
Median age NV 81.0 92.0 91.6 92.3 NV 72.4 70.3 69.7 79.5
Percent male NV 92.1 88.0 85.7 NV NV 65.9 74.7 71.4 NV
Percent female NV 85.0 87.7 90.8 84.2 NV 80.0 73.5 71.0 52.6
Median household income 80.5 92.2 89.1 92.5 91.7 74.4 69.9 68.3 75.0 91.7
Median home value 100 87.2 87.4 94.7 91.7 83.3 75.9 69.5 66.7 66.7
Percent below poverty line 90.8 92.0 90.4 81.6 75.0 72.4 70.4 71.3 72.4 75.0
Percent of homes receiving food stamps 91.1 88.6 90.4 81.0 100 70.4 70.7 74.7 71.4 100
Percent without HS diploma 89.9 90.3 88.2 87.2 88.5 66.4 73.1 71.8 69.2 96.2
Percent with HS diploma NV 87.5 87.6 90.7 89.9 NV 77.3 71.1 73.2 66.4
Percent with BA degree 86.0 89.5 85.7 94.9 91.3 74.0 71.4 73.0 61.0 78.3
Percent white 100 80.0 91.9 89.0 89.8 77.8 75.6 83.8 67.1 67.7
Percent white—not Hispanic 90.0 81.5 90.9 89.2 91.7 90.0 81.5 90.9 89.2 91.7
Percent African American 89.6 86.9 89.3 87.5 92.3 73.8 65.5 67.9 68.8 76.9
Percent American Indian 92.5 89.0 86.6 87.4 97.0 73.6 72.0 62.5 76.7 81.8
Percent Asian 92.1 90.4 81.6 87.0 100.0 69.5 77.7 66.7 68.5 93.3
Percent Hispanic 91.9 86.7 92.8 84.5 86.4 74.8 63.8 67.5 75.9 84.1
GeoJournal
123
provide unique information or whether they were
mostly redundant. The 401 features (schools) in our
data set, i.e. both features for which a match was found
and for which a match was not found, were assessed.
The results of the PCA identified five factors from the
eighteen input properties as able to describe approx-
imately 75 % of the variance. However, the analysis
did not find any of the 18 demographic indicators to be
redundant, and therefore none could be excluded.
Following this, the rotated component matrix was
computed using Varimax and the members of each
component were evaluated in an effort to understand the
particular relationships between the demographic prop-
erties. In this analysis the first factor included eleven of
the eighteen demographic properties from across the
four demographic categories identified in Table 1. The
other four factors identified included fewer demo-
graphic properties; however, the groupings of the
components for each of the factors were composed of
a mixture of demographic indicator categories pre-
sented in Table 1. Consequently, the PCA did not
indicate specific demographic categories (i.e. Popula-
tion, Economic Status, Education, Race/Ethnicity) that
drive the variability in VGI completeness.
Linear regression
Linear regression analyses were conducted on the
quality measures of both OSM and OSMCP in order to
assess whether they can be related to the demographic
properties associated with their location. In the
regression model positional accuracy was used as the
dependent variable and the demographic properties as
the explanatory variables on the 288 matched features
from the ORNL-OSM feature matching process and
the 357 matched features from the ORNL-OSMCP
feature matching process (section ‘‘VGI quality and
demographics data’’). Based on the results of the
regression standardized residuals were computed and
were checked for a possible linear relationship
between positional accuracy and the model residuals.
The results of this process are shown in Fig. 4 for both
the OSM and the OSMCP data sets. As can be seen,
both datasets show high correlations (0.92 and 0.95)
respectively, indicating a violation of the homosce-
dasticity. It should be noted that while three observa-
tions in the OSM and OSMCP data sets were initially
suspected as outliers, the overall behavior of the
standardized residuals suggests that the regression
model would not be valid even if these suspected
observations are removed due to the strong linearity of
the standardized residuals, particularly in the case of
the OSMCP dataset.
Given these results, further investigation of the
demographics indicators was carried out in order to
determine whether any multicolinearities exist
between demographic indicators in Table 1, as such
multicolinearities may result in the standardized
residuals behavior that was observed in Fig. 4. This
analysis revealed that the following demographic
indicators were identified as multicolinearities:
percent females, percent males, percent with and
without HS diploma, percent white, and percent
African American. These indicators were then
removed and the linear regression correlation matrix
results for the OSM and OSMCP datasets were
recalculated. However, no significant correlation was
observed between the positional accuracy and any of
the remaining demographic indicators, suggesting
that these indicators do not explain the positional
accuracy error behavior in our datasets. It is worth
noting that low correlation was observed between
positional accuracy and the population of Asians
across the study area.
Fig. 4 A plot of positional accuracy versus standardized
residuals for the OSM and OSMCP data sets
GeoJournal
123
Discriminant analysis
Discriminant analysis was used to assess the relation-
ship between completeness and demographics in the
OSM and OSMCP data sets. This was carried out by
analyzing the differences in the demographic indica-
tors for those features that were matched against the
unmatched features for the two datasets. This analysis
yields an interesting result, in that within the same
dataset, there is a statistically significant difference
between those features that were successfully matched
within the study area and those features that were
unmatched. The results were consistent for both the
OSM and the OSMCP datasets indicating that there is
a statistical difference between the demographic
properties of the matched features and those of the
unmatched features.
Spatial analysis results
Nearest neighbor (NN) analysis
Nearest neighbor analysis was performed in order to
explore whether a spatial pattern (clustering or
dispersion) in positional accuracy and completeness
could be identified for both the OSM and OSMCP data
sets. As described in section ‘‘Spatial analysis’’, the
premise behind this approach is that if such patterns
are detected, then they can be evaluated against spatial
patterns of different demographic indicators to iden-
tify any similar spatial patterns. If similar patterns are
found with respect to a specific demographic indicator,
then that indicator can serve as a potential driver of
VGI quality. Such analysis can also lead to possible
insights on how demographics may affect different
types of quality measures. For example, in the case of
completeness, if there is a statistically significant
difference in the spatial pattern of demographic
indicators associated with matched features compared
to the spatial pattern of the same demographic
indicators associated with unmatched features, then
those demographic properties could potentially drive
VGI completeness rates. Testing whether the VGI
quality measures exhibit any patterns is therefore the
first step in developing an understanding of possible
relationships between VGI quality and demographics.
Table 4 summarizes the results of the Nearest
Neighbor analysis. In this analysis, features that could
be matched in the OSM and OSMCP data sets to the
ORNL data set were analyzed in order to determine if
they are clustered, dispersed, or randomly distributed.
In addition, a NN analysis was applied to the ORNL
data set in order to determine if the reference features
themselves exhibit a spatial pattern. For each of these
analyses a z-score was computed in order to establish
statistical significance. Z-score values above 2.58
indicate a 99 % confidence level that the data is not
randomly distributed while values above 2.33 indicate
a 98 % confidence level and values above 1.96
indicate a 95 % confidence level. As can be expected,
when the spatial distribution of all features (schools) is
analyzed a statistically significant pattern emerges
even at 99 % confidence level. This is expected, as
generally schools are not randomly distributed across
a given region. For the matched VGI features, both the
OSM and the OSMCP features did not exhibit a
statistically significant pattern at a 99 % confidence
level. When reduced to 98 %, the matched features in
the OSM data set do exhibit a statistically significant
pattern, while further reduction to 95 % leads to a
statistically significant pattern in both VGI data sets.
Moran’s I and local indicator of spatial association
(LISA)
Following the NN analysis, spatial autocorrelation
analysis was carried out at the global and local scales.
As described in section ‘‘Spatial analysis’’, Moran’s I
is used to in this study to determine whether the
demographics indicators and the positional accuracy
measures in the two VGI data sets are spatially
autocorrelated at the global scale. Table 5 summarizes
the results of the Moran’s I analyses of the demo-
graphic indicators. As can be seen, the z-scores of two
of the demographic properties, namely total popula-
tion, and percent with HS diploma, failed to reach the
99 % confidence level, while the remaining indicators
did show a statistically significant spatial autocorre-
lation. The fact that these two demographic properties
Table 4 Z-scores of matched features calculated from a NN
analysis
Data set Data Z-score
OSM Matched features -2.54
OSMCP Matched features -1.99
ORNL All school features -3.72
GeoJournal
123
are most likely random greatly reduces their utility as a
property to be used for pattern analysis. Since the
accuracy of the OSM data was shown to have a non-
random distribution in the nearest neighbor analysis
described above, the spatial autocorrelation of the
positional accuracy distance of both OSM and
OSMCP was also computed. Table 6 provides the
results of these analyses. The low z-scores for both
datasets indicate that the accuracy is randomly
distributed throughout the study area. In summary,
these analyses show that sixteen of the demographic
properties are non-randomly distributed while posi-
tional accuracy on the OSM and OSMCP data sets is
randomly distributed. This challenges the premise that
spatial patterns in demographic indicators could
provide a possible explanation to the spatial distribu-
tion of positional accuracy and completeness in our
study area.
Following the global autocorrelation analysis,
LISA was applied to identify possible local autocor-
relation relationships in the positional accuracy mea-
sures of OSM and OSMCP. Once a dataset is shown to
exhibit spatial autocorrelation, LISA can identify
areas within the overall dataset that generate the
spatial autocorrelation. Within this analysis, LISA was
used to determine which, if any, features in the OSM
and OSMCP data drive the spatial autocorrelation
values that are observed. Although both the OSM and
OSMCP datasets yielded low z-scores indicating no
significant spatial autocorrelation at the global scale,
LISA was applied on the OSM and OSMCP data sets
to explore whether evidence of spatial autocorrelation
could be found at the local scale. Figure 4 shows the
LISA analysis results overlaid on a population density
map of the study area. The darker red circles are
classified as High–High Moran’s I values, and dark
blue circles are classified as Low–Low Moran’s I
values. The lighter red and blue circles are locations
where the values were mixed, high and low indication
the presence of an outlier. The white circles, which
include 356 of the 401 records (89 %), are locations
where LISA did not identify a local spatial autocor-
relation pattern. The LISA analysis results show for
the OSM dataset a Low–Low cluster located towards
the southeast of the study area, although this does not
appear to correlate with the population density values.
The OSMCP data had a Low–Low cluster to the north
of the center of the study area and one further north
than that. However, in each case, the insignificant
records greatly outnumber the others and are inter-
spersed within them, suggesting that overall local
clustering occurs very sporadically in the study area.
This observation is in contrast to the spatial autocor-
relation that was observed for the demographic
indicators (as shown in Table 4), suggesting that the
studied demographic indicators do not support the
behavior of the VGI quality measures in the study
area. The LISA analysis results therefore indicate that
while no global spatial autocorrelation appeared to
exist in VGI quality, some local clusters do emerge
within the study area (Fig. 5).
Global exploratory regression
Exploratory regression was used to assess at a global
study area scale both completeness and positional
Table 5 Scores of spatial autocorrelation in the demographic
indicators (non-significant results at a 99 % confidence level
are marked in bold)
Demographic property Z-score
Total population 2.462
Population density 23.827
Median age 10.898
Percent male 4.509
Percent female 3.414
Median income 5.918
Median home value 16.494
Percent below poverty line 16.033
Percent of homes receiving food stamps 16.398
Percent without HS diploma 21.322
Percent with HS diploma 1.779
Percent with BA degree 23.360
Percent white 11.517
Percent white—not Hispanic 22.163
Percent African American 26.173
Percent American Indian 25.294
Percent Asian 12.127
Percent Hispanic 32.354
Table 6 Z-scores of positional accuracy calculated from
Moran’s I
Data set Z-score
OSM positional accuracy 0.23
OSMCP positional accuracy 0.32
GeoJournal
123
accuracy with respect to demographic indicators
presented in Table 1 above. The analysis of the
OSM and OSMCP dataset indicated several demo-
graphic indicators presented in Table 1 demonstrated
multicolinearity issues (see section ‘‘Linear regres-
sion’’). These included: percent female, percent HS
diploma, percent white and percent African American.
The exploratory regression results indicated that three
demographic indicators: Percent white not Hispanic,
percent black, and percent Hispanic also exhibited
multicolinearlity. These indicators were then removed
from further analysis and the exploratory regression
was run again. The results of the second exploratory
regression of positional accuracy data for both the
OSM and OSMCP datasets were consistent with the
results of the spatial autocorrelation reported above
and yielded no discernable model with respect to
demographic indicators that would explain the spatial
distribution of positional accuracy. Similarly, results
of the exploratory regression for completeness, using
presence (1) and absence (0) as the determinant values
was conducted for both the OSM and OSMCP datasets
respectively. Similarly to positional accuracy, the
results did not yield a discernible model of complete-
ness with respect to demographic indicators.
Geographically weighted regression
Following the results of the exploratory linear regres-
sion, GWR was applied in order to explore possible
relationships between demographic indicators and
positional accuracy and completeness at the local
scale. However, this analysis did not yield any
statistically significant results (at a 95 % confidence
level) or discernible regression models. These results
are consistent with the regression results presented
earlier.
Discussion
The analyses carried out in this study addressed the
commonly suggested notion that VGI quality is
generally associated demographic indicators of the
(a) (b)
Fig. 5 LISA analysis of positional accuracy for the OSM a OSMCP, b VGI data sets
GeoJournal
123
area covered by the VGI (e.g. Haklay 2010; Girres and
Touya 2010; Zielstra and Zipf 2010; Cipeluch et al.
2010; Elwood 2009; Zook and Graham 2007a, b;
Crutcher and Zook 2009). Using 18 demographic
indicators from four separate categories (general
population; economic status; educational attainment;
and race/ethnicity), and focusing on point features
(schools), we examined whether such associations
could be detected, and if so whether they are
statistically significant. In order to accomplish this,
we used two VGI sources (OSM and OSMCP) and
compared them to a reference (ORNL) using the
feature matching methodology developed by Jackson
et al. (2013). This enabled us to evaluate two quality
measures of each VGI source, namely positional
accuracy and completeness, and compare and contrast
them with the various demographic indicators. Our
analysis included both non-spatial and spatial meth-
ods, which ranged from the comparison of the
distributions of the quality measure and simple
regression to exploratory and geographically weighed
regressions. The results of these analyses however,
failed to identify a clear and consistent association (or
statistically significant correlation) between either
positional accuracy or completeness with any of the
demographic properties. While some associations did
emerge (e.g. in the NN or LISA analysis), they were
found to be localized and sporadic. As a result, we
were not able to identify any patterns or combinations
of demographic indicators that are associated with
improved VGI quality.
These results suggest that, at least in some cases,
the underlying mechanisms that control VGI quality
are more involved, and that modeling VGI quality
through a direct relation with demographic indicators
may not be able to account for the intricate nature of
VGI quality. One potential explanation for the
emerging complex nature of VGI quality is that in
general, VGI contributions are typically not restricted
to contributors from the mapped area, thus enabling
virtually any user across the globe to contribute
information. Consequently, the VGI quality charac-
teristics resulting from such process are driven by a
potentially heterogeneous mixture of demographic
indicators, rendering the process of identifying spe-
cific demographic drivers of VGI quality difficult. To
some extent, this issue has been addressed in this study
through the use of the OSMCP data set for which the
contributors were indeed local to the area covered by
the dataset. However, clear associations with demo-
graphic indicators were not found in this dataset as
well.
Recent research related to our study by Li et al.
(2013), which focused on social media usage (Twitter
and Flickr) did find a relationship between Twitter
usage percentage of well-educated people with an
advanced degree and high income. In addition, high
Flickr activity was found to be correlated with a high
percentage of highly educated white and Asian people.
Similarly, Kent and Capello (2013), who studied the
use of social media during a crisis situation (a wildfire)
indicated that demographic characteristics of the area
impacted by the emergency situation could be used to
reveal the propensity of its population to contribute
information in social media during such a crisis. While
our findings appear to be in contrast to these findings,
it is important to note that their focus is on social
media rather than VGI. Arguably, social media offers
its users an environment that is substantially different
from that of VGI, which may lead to differences in the
motivation of users to contribute information as well
as differences in usage patterns. Consequently, the
question whether the findings of these recent studies
could be generalized to a VGI setting as well remains
open.
It is important to recognize that there are several
noteworthy limitations to our study. First, our study
focused on a single type of features, namely schools.
While this was beneficial for the construction of our
analysis workflow, further work is required in order to
explore whether our findings are consistent across
different feature types (e.g. stores or hospitals). In
addition, our study did not explore other types of VGI
quality measures, such as attribute accuracy or logical
consistency. Lastly, additional analysis methods—
both non-spatial and spatial—and other demographic
indicators should be further explored in an attempt to
model the relationship between demographic indica-
tors and VGI quality. For instance, while our analysis
focused on a linear model, non-linear models should
also be explored.
Summary and outlook
The enabling of citizens without formal training to
produce geographical products for mass consumption
through the use of web-based tools and technologies is
GeoJournal
123
introducing opportunities and challenges for our field.
Opportunities arise from the availability of additional
data that may extend the coverage of authoritative
datasets, or even better represent particular types of
events (e.g. capturing rapidly evolving events). The
challenges are associated with the integration of such
contributions with authoritative content. This integra-
tion is impeded by the lack of an understanding of the
accuracy of VGI datasets, and potential patterns behind
the variations of such accuracy. The study presented
within this paper has explored the relationship between
demographics and VGI quality, in order to assess the
often-repeated argument that demographics may relate
to corresponding accuracy variations.
We conducted a quantitative study to address this
issue, using data from a major metropolitan area, and
focused on point-represented areal features. The Denver
metro area that was selected as our study area offers a
unique advantage as we have available for it not only
authoritative (ORNL) and traditional crowdsourced
content (OSM) but also content derived through a
hybrid, partially supervised crowdsourcing process
(OSMCP). For this area we selected schools as a
representative feature type, for a variety of reasons.
First, schools tend to be more easily identifiable, as they
are large facilities, with well-defined components.
Secondly, in a demographically segregated city such
as Denver (James 1986; Aske et al. 2011) school
locations can be viewed as representative samples of
demographically diverse neighborhoods. The results of
our analysis do not support the arguments that a
correlation may exist between VGI error and local
demographic properties, as they show no statistically
significant such association.
VGI on a massive scale is a very complex process
and we need to gain a better understanding of the
mechanics that drive participation and content quality.
This study addressed one potential avenue, assessing
the role of demographics. While the findings of this
first quantitative study do not support earlier argu-
ments for a correlation between demographics and
accuracy, a natural future extension of this work would
be to extend the analysis to additional areas, and to
consider additional areal feature types (e.g. hospitals,
fire and police stations). Given the findings of this
research, one could argue that another logical next step
would be to study the process characteristics (rather
than its spatial and demographic indicators) and their
relation to data quality, because the process may be
more complex than can be described by work done so
far. Understanding who and why contributes VGI data
still remains an open research question. As Steinmann
et al. (2013) noted, a small percentage of contributors
are responsible for the majority of mapping. Deter-
mining which ‘small percent’ of the local population is
actually contributing poses a significant challenge to
future research relating demographics of place with
VGI contribution quality. It is through such studies
that we will be able to gain an understanding of
patterns behind the volunteering of geographical
information, and the corresponding accuracy varia-
tions. This will improve our capabilities to integrate
such datasets with authoritative collections, thus
allowing us to harvest the full potential of VGI.
References
Al-Bakri, M., & Fairbairn, D. (2010), Assessing the accuracy of
‘Crowdsourced’ data and its integration with official spa-
tial data sets. In Proceedings of the 9th international
symposium on spatial accuracy assessment in natural
resources and environmental sciences, Leicester, UK,
pp. 317–320.
Andrews, D. T., Chen, L., Wentzell, P. D., & Hamilton, D. C.
(1996). Comments on the relationship between principal
components analysis and weighted linear regression for
bivariate data sets. Chemometrics and Intelligent Labora-
tory Systems, 34(2), 231–244.
Anselin, L. (1995). Local indicators of spatial association—
LISA. Geographical Analysis, 27(2), 93–115.
Aske, D., Corman, R. R., & Marston, C. (2011). Education
policy and school segregation: A study of the Denver
metropolitan region. Journal of Legal, Ethical & Regula-
tory Issues, 14(2), 27–35.
Braun, M. T., & Oswald, F. L. (2011). Exploratory regression
analysis: A tool for selecting models and determining
predictor importance. Behavior Research Methods, 43(2),
331–339.
Brown, G., & Pullar, D. (2012). An evaluation of the use of
points versus polygons in public participation geographic
information systems using quasi-experimental design and
Monte Carlo simulation. International Journal of Geo-
graphical Information Science, 26(2), 231–246.
Burt, J., Barber, G., & Rigby, R. (2009). Elementary statistics
for geographers (3rd ed.). New York, NY: Guilford Press.
Cipeluch, B., Jacob, R., Winstanly, A., & Mooney, P. (2010).
Comparison of the accuracy of OpenStreetMap for Ireland
with Google Maps and Bing Maps. In Proceedings of the
9th international symposium on spatial accuracy assess-
ment in natural resources and environmental sciences,
Leicester, UK, pp. 337–340.
Clark, P. J., & Evans, F. C. (1954). Distance to nearest neighbor
as a measure of spatial relationships in populations. Ecol-
ogy, 35(4), 445–453.
GeoJournal
123
Coleman, D. J., Georgiadou, Y., & Labonte, J. (2009). Volun-
teered geographic information: The nature and motivation
of produsers. International Journal of Spatial Data Infra-
structures Research, 4(1), 332–358.
Crutcher, M., & Zook, M. (2009). Placemarks and waterlines:
Racialized cyberscapes in Post-Katrina Google Earth.
Geoforum, 40(4), 523–534.
Davis, J. (1973). Statistics and data analysis in geology. New
York, NY: Wiley.
de Smith, M. J., Goodchild, M. F., & Longley, P. A. (2007).
Geospatial analysis: A comprehensive guide to principles,
techniques and software tools (2nd ed.). Winchelsea, UK:
The Winchelsea Press.
Elwood, S. (2008). Volunteered geographic information: Key
questions, concepts and methods to guide emerging
research and practice. GeoJournal, 72(3–4), 133–135.
Elwood, S. (2009). Geographic information science: Emerging
research on the societal implications of the geographical
web. Progress in Human Geography, 34(3), 349–357.
Elwood, S., Goodchild, M. F., & Sui, D. (2013). Prospects for
VGI research and the emerging fourth paradigm. In D. Sui,
S. Elwood, & M. F. Goodchild (Eds.), Crowdsourcing
geographic knowledge: Volunteered geographic informa-
tion (VGI) in theory and practice (pp. 361–375). New
York, NY: Springer.
Fairbairn, D., & Al-Bakri, M. (2013). Using geometric proper-
ties to evaluate possible integration of authoritative and
volunteered geographic information. ISPRS International
Journal of Geo-Information, 2(2), 349–370.
Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002).
Geographically weighted regression & associated tech-
niques. Chichester, UK: Wiley.
Ghose, R., & Elwood, S. (2003). Public participation GIS and
local political context: Propositions and research direc-
tions. URISA Journal, 15(2), 17–22.
Girres, J.-F., & Touya, G. (2010). Quality assessment of the
French OpenStreetMap dataset. Transactions in GIS,
14(4), 435–459.
Goodchild, M. F. (2007). Citizens as sensors: The world of
volunteered geography. GeoJournal, 69(4), 211–221.
Goodchild, M. F., & Li, L. (2012). Assuring the quality of
volunteered geographic information. Spatial Statistics,
1(1), 110–120.
Graham, S. D. (2005). Software-sorted geographies. Progress in
Human Geography, 29(5), 562–580.
Haklay, M. (2010). How good is volunteered geographical
information? A comparative study of OpenStreetMap and
ordnance survey datasets. Environment and Planning B,
37(4), 682–703.
Haklay, M. M., Basiouka, S., Antoniou, V., & Ather, A.
(2010). How many volunteers does it take to map an area
well? The validity of Linus’ Law to volunteered geo-
graphic information. The Cartographic Journal, 47(4),
315–322.
Haklay, M., & Weber, P. (2008). Openstreetmap: User-gener-
ated street maps. IEEE Pervasive Computing, 7(4), 12–18.
Hochmair, H. H., & Zielstra, D. (2013). Development and
completeness of points of interest in free and proprietary
data sets: A Florida case study. Creating the GISociety—
Conference proceedings (pp. 39–48). Austria: Salzburg.
Holloway, T., Bozicevic, M., & Borner, K. (2007). Analyzing
and visualizing the semantic coverage of Wikipedia and its
authors. Complexity, 12(3), 30–40.
Huberty, C. J. (1984). Issues in the use and interpretation of
discriminant analysis. Psychological Bulletin, 95(1),
156–171.
Hudson-Smith, A., Crooks, A. T., Gibin, M., Milton, R., &
Batty, M. (2009). Neogeography and Web 2.0: Concepts,
tools and applications. Journal of Location Based Services,
3(2), 118–145.
Jackson, S. P., Mullen, W., Agouris, P., Crooks, A. T., Croitoru,
A., & Stefanidis, A. (2013). Assessing completeness and
spatial error of features in volunteered geographic infor-
mation. ISPRS International Journal of Geo-Information,
2(2), 507–530.
James, F. J. (1986). A new generalized ‘‘Exposure-Based’’
segregation index demonstration in Denver and Houston.
Sociological Methods & Research, 14(3), 301–316.
Kent, J. D., & Capello, H. T. (2013). Spatial patterns and
demographic indicators of effective social media content
during the Horsethief Canyon fire of 2012. Cartography
and Geographic Information Science, 40(2), 78–89.
Koukoletsos, T., Haklay, M., & Ellul, C. (2012). Assessing data
completeness of VGI through an automated matching
procedure for linear data. Transactions in GIS, 16(4),
477–498.
Kuznetzov, S. (2006). Motivations of contributors to Wikipedia.
ACM SIGCAS Computers and Society, 35(2), 1–7.
Li, L., Goodchild, M. F., & Xu, B. (2013). ‘Spatial. Temporal,
and socioeconomic patterns in the use of Twitter and
Flickr’, cartography and geographic information science,
40(2), 61–77.
Longley, P. A., Goodchild, M. F., Maguire, D. J., & Rhind, D.
W. (2010). Geographical information systems and science
(3rd ed.). New York, NY: Wiley.
Longley, P. A., & Singleton, A. D. (2009). Linking social
deprivation and digital exclusion in England. Urban
Studies, 46(7), 1275–1298.
McKechnie, J. (1983). Webster’s new twentieth century dictio-
nary (2nd ed.). New York, NY: Simon and Schuster.
Mooney, P., & Corcoran, P. (2012). Characteristics of heavily
edited objects in OpenStreetMap. Future Internet, 4(1),
285–305.
Mooney, P., Corcoran, P., & Winstanley, A. (2010). Towards
quality metrics for OpenStreetMap. In Proceedings of the
18th SIGSPATIAL international conference on advances in
geographic information systems, San Jose, CA, pp. 514–517.
Moran, P. A. (1950). Notes on continuous stochastic phenom-
ena. Biometrika, 37(1–2), 17–23.
Neis, P., Zielstra, D., & Zipf, A. (2011). The street network
evolution of crowdsourced maps: OpenStreetMap in Ger-
many 2007–2011. Future Internet, 4(1), 1–21.
Neis, P., & Zipf, A. (2012). Analyzing the contributor activity of
a volunteered geographic information project—The case of
OpenStreetMap. ISPRS International Journal of Geo-
Information, 1(2), 146–165.
Nov, O., Arazy, O., & Anderson, D. (2011). Technology-med-
iated citizen science participation: A motivational model.
In Proceedings of the 5th international AAAI conference on
weblogs and social media, Barcelona, Spain.
GeoJournal
123
OpenStreetMap. (2013a). Tag: Amenity = school. http://wiki.
openstreetmap.org/wiki/Tag:amenity%3Dschool. Acces-
sed on 17 May 2013.
OpenStreetMap. (2013b). USGS geographic names information
system. http://wiki.openstreetmap.org/wiki/GNIS. Acces-
sed on 17 May 2013.
Oreg, S., & Nov, O. (2008). Exploring motivations for con-
tributing to open source initiatives: The roles of contribu-
tion context and personal values. Computers in Human
Behavior, 24(5), 2055–2073.
Over, M., Schilling, A., Neubauer, S., & Zipf, A. (2010). Gen-
erating Web-based 3D city models from OpenStreetMap:
The current situation in Germany. Computers, Environ-
ment and Urban Systems, 34(6), 496–507.
Poore, B. S., Wolf, E. B., Korris, E. M., Walter, J. L., & Mat-
thews, G. D. (2012). Structures data collection for the
national map using volunteered geographic information.
U.S. Geological Survey open-file report 2012–1209, Res-
ton, VA. http://pubs.usgs.gov/of/2012/1209.
Porter, C. E., & Donthu, N. (2006). Using the technology
acceptance model to explain how attitudes determine
internet usage: The role of perceived access barriers and
demographics. Journal of Business Research, 59(9),
999–1007.
Press, S. J., & Wilson, S. (1978). Choosing between logistic
regression and discriminant analysis. Journal of the
American Statistical Association, 73(364), 699–705.
Schmidt, M., & Klettner, S. (2013). Gender and experience-
related motivators for contributing to OpenStreetMap.
Online proceedings of the international workshop on
action and interaction in volunteered geographic infor-
mation (ACTIVITY) at the 16th AGILE conference on
geographic information science, Leuven, Belgium.
Sieber, R. (2006). Public participation geographic information
systems: A literature review and framework. Annals of the
Association of American Geographers, 96(3), 491–507.
Steinmann, R., Grochenig, S., Rehrl, K., & Brunauer, R. (2013).
Contribution profiles of voluntary mappers in Open-
StreetMap. Online proceedings of the international
workshop on action and interaction in volunteered geo-
graphic information (ACTIVITY) at the 16th AGILE con-
ference on geographic information science, Leuven,
Belgium.
Sui, D. (2008). The wikification of GIS and its consequences: Or
Angelina Jolie’s new tattoo and the futureof GIS. Com-
puters, Environment and Urban Systems, 32(1), 1–5.
The National Map. (2014). http://nationalmap.gov/
TheNationalMapCorps/index.html. Accessed on 23 May
2014.
Tulloch, D. L. (2008). Is VGI participation? From vernal pools
to video games. GeoJournal, 72(3–4), 161–171.
U.S. Census Bureau. (2012). Geographic definitions. http://
www.census.gov/geo/www/geo_defn.html#CensusTract.
Accessed on 17 May 2013.
Vickery, G., & Wunsch-Vincent, S. (2007). Participative web
and user-created content: Web 2.0 wikis and social net-
working. Organization for Economic Cooperation and
Development (OECD), Paris, France.
Wheeler, D., & Tiefelsdorf, M. (2005). Multicollinearity and
correlation among local regression coefficients in geo-
graphically weighted regression. Journal of Geographical
Systems, 7(2), 161–187.
Wong, D. W. S., & Lee, J. (2005). Statistical analysis of geo-
graphic information with ArcView GIS and ArcGIS.
Hoboken, NJ: Wiley.
Zielstra, D., & Zipf, A. (2010). A comparative study of pro-
prietary geodata and volunteered geographic information
for Germany. In Proceedings of the 13th AGILE interna-
tional conference on geographic information science,
Guimaraes, Portugal, pp. 1–15.
Zook, M. A., & Graham, M. (2007a). The creative reconstruc-
tion of the internet: Google and the privatization of
cyberspace and DigiPlace. Geoforum, 38(6), 1322–1343.
Zook, M. A., & Graham, M. (2007b). Mapping DigiPlace:
Geocoded internet data and the representation of place.
Environment and Planning B, 34(3), 466–482.
GeoJournal
123