Towards a Seamless World Names Database
Transcript of Towards a Seamless World Names Database
Using Twitter data as demographicdata
A. Leak∗ M. Adnan∗∗ P. Longley∗∗
∗Jill Dando Institute of Security and Crime ScienceUCL
∗∗Department of GeographyUCL
Association of American Geographers 2014
Outline
1. The World Names Database
2. Extraction of names and locations from social media data
3. Utility of such methods
4. Application
5. Summary and future work
The World Names Database
1. Representative of 2 Billion of the Earth’s population
2. 26 countries over 4 continents
3. Sourced from electoral roll and telephone directory data
Figure: Current coverage of World Names Database.
The World Names DatabaseCurrent limitations:
1. Not a complete representation of world population
I Data not available for many countriesI Existing data increasingly uncertain over time
Proposed solution:
1. Extraction of equivalent data from online social media
Data and toolsData
1. TwitterI Seamless database of 1 billion geo-located tweetsI Collected as part of the Uncertainty of Identity
project1 between Sep 2012 and Oct 2013
Figure: Current coverage of Twitter data.
1see http://www.uncertaintyofidentity.com
Data and toolsData (continued)
2. UK Enhanced Electoral Roll from CACI Ltd
I 46 million unique users
3. Spanish telephone directory
I 11 million unique records
Tools
1. PostgreSQL (PostgreSQL GDG, 2013)
2. R (R Core Team, 2013)
OSM challenges and opportunitiesOpportunities
1. Twitter has 212 million active users submitting greater than100 million messages per day
2. Rich data in both space and time
Challenges
1. Users have multiple locations
2. Users may have multiple names
3. Not all users are people. i.e @w_pat_tc posts patents withlocation data
Identification of residential area
1. All users to tweet within target geography identified
2. All tweets by identified users isolated
3. All tweets spatially joined to suitable geography
4. Users with >= 5 & > 50 % of their total tweets withinadministrative border retained
Extraction of names
1. Twitter screen names segmented and cleaned
2. Segments tested against known prefixes and suffixes
3. Segments ordered to reflect likely forename - surnamestructure
//** Mr Alistair Leak ////////2014""
////MR, ALISTAIR, LEAK
ALISTAIR, LEAK
Benchmarking: Method
1. Twitter vs data of known provenance - UK electoral roll orSpanish telephone directory
2. Both sets of users assigned to consistent nationalgeography
3. Morisitia-Horn similarity (MHS) calculated between twouser groups for each administrative zone
I MHS applied in R using Vegan Package2 (Oksanenet al., 2013)
I Score of 0 indicates equal composition of names and1 indicated no overlap
I MHS method is recognised for its ability to handlediffering sample sizes (Wolda, 1981)
2Vegan implementation is as a dissimilarity score i.e 1 - MHS
Benchmarking: UK
1. Enhanced Electoral Roll and Twitter users assigned toGADM level 2 administrative geography (Hijmans et al.,2013)
2. Morisita-Horn index of overlap calculated at GADM levels0, 1 and 2
Morisita-Horn HedrickValue 0.0902 0.9098
Table: Summary statistics for UK GADM level 0
Morisita-Horn HedrickMin. 0.0624 0.8815
Mean 0.1022 0.8978Max. 0.1185 0.9376
Table: Summary statistics for UK GADMlevel 1
Morisita-Horn HedrickMin. 0.06816 0.1145
Mean 0.33883 0.6612Max. 0.88547 0.9318
Table: Summary statistics for UK GADMlevel 2
Benchmarking: UK 3Classification Accuracy
I Stratified sample of 1073 users
I Users declared location manually geocoded.
Metric ValueSample size 1073Ambiguous 133
Aspatial 416Usable 524
Accuracy 0.8386Kappa 0.8328
Table: Confusion Matrix Results
Benchmarking: Spain
1. Spanish telephone directory and Twitter users assigned toGADM level 3 administrative geography (Hijmans et al.,2013)
2. Morisita-Horn index of overlap calculated at GADM levels0, 1, 2 and 3
Morisita-Horn HedrickValue 0.2044713 0.7955287
Table: GADM level 0
Morisita-Horn HedrickMin. 0.1655 0.5997
Mean 0.2679 0.7321Max. 0.4003 0.8345
Table: GADM level 1
Morisita-Horn HedrickMin. 0.1616 0.4545
Mean 0.2947 0.7053Max. 0.5455 0.8384
Table: GADM level 2
Morisita-Horn HedrickMin. 0.1711 0.0000
Mean 0.4867 0.5133Max. 1.0000 0.8289
Table: GADM level 3
Benchmarking: Spain 2
Figure: Morisita Horn Index for Spain GADM levels 1 (top left), 2 (top right) and 3(bottom)
Mexico
1. Population of 120 million
2. Believed Twitter adoption of 11% (source: peerreach.com)
Method:
1. Users regions identified at GADM level 1 (n zones)
2. Names processes using identical method to Spain
3. For the WND, counts of surname by area standardised toreflect true population distribution
Results:
1. 165,000 users allocated a region of residence
Conclusions and future workFurture work:
1. Further development of name extraction methodology
2. Dynamic spatial resolution based on individual users data
3. Increased automation of the location extraction framework
4. Further benchmarking of data against existing sources
Conclusions:
1. Not all countries are suitable for the Twitter based method
2. Balance between precision, accuracy and utility in terms ofspatial resolution
References
Robert Hijmans et al. GADM. 2013.
Jari Oksanen et al. vegan: Community Ecology Package. R package version2.0-10. 2013.
PostgreSQL GDG. PostgreSQL. http://www.postgresql.org. 2013.
R Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing. Vienna, Austria, 2013.
Henk Wolda. “Similarity indices, sample size and diversity”. In: Oecologia 50.3(1981), pp. 296–302.