ESSnet Big Data II - European Commission

ESSnet Big Data I I

G r a n t A g r e e m e n t N u m b e r : 8 4 7 3 7 5 - 2 0 1 8 - N L - B I G D A T A

h t t p s : / / w e b g a t e . e c . e u r o p a . e u / f p f i s / m w i k i s / e s s n e t b i g d a t a

h t t p s : / / e c . e u r o p a . e u / e u r o s t a t / c r o s / c o n t e n t / e s s n e t b i g d a t a _ e n

W o rkpa c ka ge J

In no va t i ve To ur i sm Sta t i s t i c s

De l i vera bl e J 5 :

F i na l Repo rt c o nta i n i ng f i na l resu l ts a nd a fu l l desc r i p t i o n o f the metho do l o gy use d

Final version, 17.11.2020

ESSnet co-ordinator:

Peter Struijs (CBS) p.struijs@cbs.nl

Workpackage Leader:

Marek Cierpiał-Wolan (Statistics Poland, Poland) M.Cierpial-Wolan@stat.gov.pl

telephone : + 48 17 85 35 210 ext. 311 mobile phone : + 48 515 037 640

Prepared by:

WPJ team

Workpackage J team:

Marek Cierpiał-Wolan (Statistics Poland) Rui Alves (INE) Vassiliki Benaki (ELSTAT) Eleni Bisioti (ELSTAT) Mascia Di Torrice (ISTAT) Maria Fernandes (INE) Boris Frankovič (SOSR) Tobias Gramlich (HESSE) Kostadin Georgiev (BNSI) Nico Heerschap (CBS) Maria Laftsidou (ELSTAT) Filipa Lidónio (INE) Asimina Katri (ELSTAT) Shirley Ortega-Azurduy (CBS) Martina Özoğlu (SOSR) Eleni Papadopoulou (ELSTAT) Christina Pierrakou (ELSTAT) Danny Pronk (CBS) Galya Stateva (BNSI) Marcela Zavadilová (SOSR)

Table of contents

Executive summary ................................................................................................................................. 5

Glossary ................................................................................................................................................... 7

Introduction ............................................................................................................................................ 8

1. Big data sources .......................................................................................................................... 9

1.1. Inventory of big data sources ................................................................................................. 9

1.2. Web scraping ........................................................................................................................ 15

1.3. Source characteristics ........................................................................................................... 18

2. Methods used for new tourism data ........................................................................................ 35

2.1. Methodology for combining and disaggregating data ................................................................ 35

2.1.1. Combining data ...................................................................................................................... 35

2.1.2. Spatial-temporal disaggregation of tourism data .................................................................. 40

2.2. Flash estimates ............................................................................................................................ 41

2.3. Methodology to improve the quality of data in various statistical areas ................................... 44

2.3.1. Estimations of the size of tourist traffic ................................................................................. 47

2.3.2. Estimation of expenses related to trips ................................................................................. 48

2.3.3. Tourism Satellite Accounts .................................................................................................... 49

3. Implementation of Tourism Integration and Monitoring System prototype ........................... 56

4. Case studies .............................................................................................................................. 63

4.1. Web scraping ........................................................................................................................ 63

WPJ.1.BG .......................................................................................................................................... 63

WPJ.1.DE-Hesse ............................................................................................................................... 65

WPJ.1.IT ..........................................................................................................................................70

WPJ.1.NL .......................................................................................................................................... 73

WPJ.1.PT .......................................................................................................................................... 79

WPJ.1.SK .......................................................................................................................................... 90

4.2. Source characteristics ........................................................................................................... 96

WPJ.2.DE-Hesse ............................................................................................................................... 96

WPJ.2.IT ........................................................................................................................................99

WPJ.2.NL ........................................................................................................................................ 105

WPJ.2.PT ........................................................................................................................................ 107

4.3. Legal aspects ....................................................................................................................... 110

WPJ.3.DE-Hesse ............................................................................................................................. 110

WPJ.3.NL ........................................................................................................................................ 112

4.4. Combining data ................................................................................................................... 114

WPJ.4.EL ........................................................................................................................................114

WPJ.4.DE-Hesse ............................................................................................................................. 118

WPJ.4.IT ........................................................................................................................................119

WPJ.4.NL ........................................................................................................................................ 124

WPJ.4.PL .......................................................................................................................................127

WPJ.4a.PT ...................................................................................................................................... 129

WPJ.4b.PT ...................................................................................................................................... 131

WPJ.4.SK ........................................................................................................................................ 136

4.5. Spatial-temporal disaggregation of data ............................................................................ 143

WPJ.5.NL ........................................................................................................................................ 143

WPJ.5.PL………. ............................................................................................................................... 149

WPJ.5.SK ........................................................................................................................................ 154

4.6. Flash estimates of the occupancy of accommodation establishments .............................. 166

WPJ.6.NL ........................................................................................................................................ 166

WPJ.6.PL. ........................................................................................................................................ 172

WPJ.6.SK ........................................................................................................................................ 175

4.7. Methodology to improve the quality of data in various statistical areas ........................... 199

WPJ.7.PL .......................................................................................................................................199

4.8. Experimental tourism statistics........................................................................................... 208

WPJ.8.BG ........................................................................................................................................ 208

WPJ.8.NL ........................................................................................................................................ 229

5. Conclusions ............................................................................................................................. 233

References .......................................................................................................................................... 237

Annexes ............................................................................................................................................... 243

Annex 1 – Quality indicators of data matching ................................................................................ 243

Annex 2 – General approach for data disaggregation...................................................................... 245

Annex 3 – Combining biased and unbiased data ............................................................................. 248

Annex 4 – Quality template for combining data from various sources ........................................... 249

Annex 5 – R script to visNetwork object WPJ.2.PT........................................................................... 262

Annex 6 – R script to visNetwork object (detail) .............................................................................. 269

Annex 7 – CBS taxonomy .................................................................................................................. 270

Annex 8 – Process for data linkage .................................................................................................. 271

Executive summary

This deliverable J5 Final Report containing final results and a full description of the methodology used,

summarizes the work implemented under the workpackage J - Innovative Tourism Statistics (WPJ) of

the ESSnet Big Data II project by eight European statistical institutes:

Statistics Poland (GUS) represented by the Statistical Office in Rzeszów (leader of the WPJ),

National Statistical Institute of the Republic of Bulgaria (BNSI),

Hellenic Statistical Authority (ELSTAT),

Hesse Statistical Office (HESSE, Germany),

Italian National Institute of Statistics (ISTAT),

Statistics Netherlands (CBS),

Statistics Portugal (INE),

and Statistical Office of the Slovak Republic (SOSR).

Detailed information related to each of the issues under consideration by the project participants can

be found in the four partial reports available on the WIKI platform of the ESSnet Big Data II project, in

the part dedicated to the WPJ package1.

The work undertaken in the project was an attempt to meet the following challenges:

• preparing an inventory of data sources related to tourism statistics (including big data sources)

in individual partner countries along with their description and classification,

• developing a scalable solution for downloading data using web scraping techniques from

selected web portals offering the possibility of booking accommodation,

• implementing methods of combining and matching data on tourist accommodation

establishments in order to integrate statistical databases with data from web scraping for the

purpose of improving the completeness of the survey population of tourist accommodation

establishments,

• spatial-temporal disaggregation of the use of tourist accommodation establishments,

• preparing flash estimates of occupancy of tourist accommodation establishments to shorten

the time of data publication for external recipients,

• developing a methodology for estimating the volume of tourist traffic and tourist expenses

with the use of various data sources (statistical and non-statistical).

The activities listed here allowed for a preliminary assessment of the impact of the obtained results

on the improvement of the data presented in the Tourism Satellite Account (TSA). In addition, the key

issue addressed by WPJ was the preparation of the Tourism Integration and Monitoring System (TIMS)

prototype along with the micro-services dedicated to the above-mentioned areas that would support

statistical production in the field of tourism statistics and assist in monitoring changes in the tourism

sector.

1https://webgate.ec.europa.eu/fpfis/mwikis/EssNetbigdata/index.php/WPJ_Milestones_and_deliverables [accessed: 23.09.2020]

The presented final report consists of 5 chapters.

The first chapter includes a description of the work implemented in the initial stage of the project and

relating to the Task 1: Inventory of big data sources related to tourism statistics. The reader will find

here a general list of data sources (including big data) identified by individual WPJ partner countries

during the inventory process. The catalogue is divided according to different subdirectories, e.g. types

of sources and frequency, availability, usefulness in estimating the demand and supply side of tourism.

In the subchapter dedicated to web scraping, attention is paid to the issue of identifying tourist portals

that will provide the most complete information on accommodation, transport or food. In the

subchapter entitled Source characteristics, the concept of mapping identified sources and variables to

official statistical variables and domains using relationship maps was presented. In this part, the reader

will get acquainted with the visNetwork tool, which provides an interactive graphical representation

of the relationships and interconnections between multi-purpose data sources, survey data and web

scraped data, variables and domains, as well as countries and experimental results.

The second chapter of the report describes the methods developed and tested during the project by

partner countries under Task 3: Developing a methodology for combining and disaggregation data

from various sources. In particular, it contains details on the methods of combining data collected by

web scraping techniques from web portals offering accommodation booking services with statistical

databases containing the population frame for the survey on the use of tourist accommodation

establishments, as well as description of methods for spatial and temporal disaggregation of data in

the field of tourism. In this chapter, the reader can also learn about the results of the methodological

work undertaken in the field of flash estimates (Task 4: Flash estimates in the field of tourism). In

subchapter 2.3 the approach and methods used for estimating a size of tourist traffic and estimating

expenses related to trips are presented. While, in the last subchapter, the reader can trace the results

related to Tourism Satellite Accounts (TSA), as part of work within Task 5: Use of big data sources and

developed methodology to improve the quality of data in various statistical areas.

The third chapter presents an example of the implementation of the main components of the Tourism

Integration and Monitoring System (TIMS) prototype with dedicated micro-services. This system will

support statistical production in the area of tourism statistics and assist in monitoring changes in the

tourism sector. This part of the document describes the user interface (UI) and user experience (UX)

of the system prototype developed as part of the WPJ work. Additionally, the reader can view

exemplary graphic interface design here.

The fourth chapter consists of case studies with results of individual tasks performed in each partner

country. The lists are grouped according to the topics they relate to (web scraping, source

characteristics, legal aspects, combining data, spatial-temporal disaggregation, flash estimates,

methodology to improve the quality of data in various statistical areas and experimental tourism

statistics). Each use case has a standardized form, which makes it possible to easily identify the country

to which the case relates and to clearly trace the preconditions required for its preparation, the steps

taken, the expected and final results obtained.

The report ends with the Conclusions and a list of annexes. One of them, Annex 4 is dedicated to the

Quality template completed for data gathered from various sources (including big data sources). The

basis of this form are the statistical concepts of ESS Single Integrated Metadata Structure, which have

been extended and adapted to big data sources by members of the workpackage K - Methodology and

Quality.

Glossary

API (Application Programming Interface) is a way, understood as a strictly defined set of rules and

their descriptions, in which computer programs communicate with each other.

BAT (batch file) is a script file in DOS, OS/2 and Microsoft Windows. It consists of a series of commands

to be executed by the command-line interpreter, stored in a plain text file.

GSBPM (Generic Statistical Business Process Model) describes and defines the set of business

processes needed to produce official statistics.

GSIM (Generic Statistical Information Model) is a reference framework of internationally agreed

definitions, attributes and relationships that describe the pieces of information that are used in the

production of official statistics (information objects). This framework enables generic descriptions of

the definition, management and use of data and metadata throughout the statistical production

process.

HTML (HyperText Markup Language) is a hypertext markup language that allows one to describe the

structure of information contained within a web page, giving meaning to individual fragments of text

– forming hyperlinks, paragraphs, headers and lists - and embedding file objects in the text.

JSON (JavaScript Object Notation) is an open standard file format, and data interchange format, that

uses human-readable text to store and transmit data objects consisting of attribute–value pairs and

array data types (or any other serializable value).

Metadata is a description of other data.

PNG (Portable Network Graphic) is commonly used file to store web graphics, digital photographs,

and images with transparent backgrounds.

RSS (Rich Site Summary) is a web feed that allows users and applications to access updates to websites

in a standardized, computer-readable format.

XLSX (Office Open XML) is a zipped, XML-based file format developed by Microsoft for representing

spreadsheets, charts, presentations and word processing documents.

XML (Extensible Markup Language) is a universal markup language designed to represent different

data in a structured way. It is platform-independent, which facilitates the exchange of documents

between different systems.

Web scraping (also known as Screen Scraping, Web Data Extraction, Web Harvesting, etc.) is

a technique used to extract large amounts of data from websites.

Web service is a service offered by an electronic device to another electronic device, communicating

with each other via the World Wide Web.

Introduction

In the last few years, tourism trends have changed, both on the demand and supply sides. The rapidly

changing and available new technologies require national statistical offices (NSIs) to continuously

adapt their IT systems to collect data, including data from unstructured big data sources, as well as to

verify and process them. Moreover, the situation forces the NSIs to improve the methods and

techniques of acquiring new types of data and to implement innovative tools to advance the

consistency and comparability of the generated results, also in the area of tourism statistics.

Additionally, the current crisis caused by the COVID-19 pandemic, affecting all types of economic

activity, has had a particularly severe impact on the tourism industry, consequently causing a lot of

intermediate negative effects across the economy in many countries. The weaker condition of tourism

means worse results in trade, transport, warehousing, agriculture or the real estate market, sports,

culture and communication. As the pandemic unfolds in Europe, a significant part of destinations is

completely or partially inaccessible due to border closure and restrictions resulting from the

suspension of international air connections. The tourist industry, as well as the behaviour of tourists

themselves, due to the restrictions introduced, have changed, which has had an impact on the level

of tourist traffic in Europe.

This crisis has shown how important it is for official statistics to have monitoring tools and systems

that can quickly detect and visualize changes in the tourism industry in the EU. There is therefore a

need for faster, more disaggregated and up-to-date information that responds to the needs of

national governments, city authorities and tourism industry stakeholders, as well as tourism

entrepreneurs. In addition, such a system would enable official statistics to monitor the changes and

effects of the reopening of national economies to local and foreign tourists. It could be used to

measure how quickly the tourism industry begins to recover once countries begin easing travel

restrictions.

The participants of the pilot WPJ package were faced, among other things, with such a challenge - the

development of the Tourism Integration and Monitoring System (TIMS) prototype along with its

individual components. Another important element related to the use of new information sources

(including big data sources) in tourism statistics was the development of innovative approaches to

combine these data with data held by official statistics, which due to their different structure,

frequency and methodological differences in terms of collecting and processing data was not an easy

All results achieved during the two-year work within the WPJ and the solutions developed in the

above-mentioned areas are presented in this report.

1. Big data sources

Today’s digital world brings us many big data sources which may help tourism statistics being more

up to date. However, there is still a big dependence on official statistics due to the nature of tourism,

because not every movement means tourism activity. Tourism is very specific industry based on

travellers behaviour and its limitations. According to European regulation 692/2011 on tourism

statistics: “tourism means the activity of visitors taking a trip to a main destination outside their usual

environment, for less than a year, for any main purpose, including business, leisure or other personal

purpose, other than to be employed by a resident entity in the place visited”. Also tourism industry

includes accommodation, food and beverage, rental, transport, cultural, sport, travel agent and

tourism operator service. Tourism is an industry characterized by the cross-sectional nature of

economic activities and industries within the economy. Therefore tourism isn’t reflected only by one

or two industries but as a whole. From big data sources is very difficult to distinguish if the movement

fulfils all the criterions of tourism, however, connecting data sources of official statistics with the big

data may make official statistics more accurate due to its correlation with tourism statistics and may

provide additional information.

1.1. Inventory of big data sources

To create the catalogue and determine the usefulness of data sources, it was necessary to inventory

them. During the work, the sources that are already available or can be used in the future to improve

the quality of official statistical data in the field of tourism were selected. The inventory included

external (including administrative), statistical and Internet sources. As a result of the work, detailed

catalogues containing characteristics of the identified sources were created. The Flow Models of these

sources were presented in the J2 deliverable. The main catalogue, containing information on the

identified sources by individual partner countries, was divided into three subdirectories:

1. catalogue by types of sources and frequency,

2. catalogue by thematic areas,

3. catalogue by usefulness – demand vs supply side.

As part of ongoing work, project partners have inventoried a total of 130 sources of information. Just

over half of them (57.7%) were external sources, while the remaining 42.3% were internal sources.

Considering the number of identified sources in individual countries, it was found that the Netherlands

(44 sources), Slovakia (20), Poland (16) and Italy (14) showed the highest number of sources.

The Netherlands showed the largest number of internal and external data sources (22 sources each).

Among the remaining countries, Italy (9) and Slovakia (8) had the highest number of internal sources,

while Poland (13) and Slovakia (12) collected the highest number of external sources. A detailed

summary of the number of identified sources broken down by type is presented in Table 1.

Table 1. Number of data sources identified broken down by type

Type of source

Country participating in the grant

BG Hesse EL IT NL PL PT SK

Number of identified sources

Total 6 12 11 14 44 16 7 20

Internal sources 4 6 2 9 22 3 1 8

External sources 2 6 9 5 22 13 6 12

During the inventory, the identified sources were divided according to their usefulness in estimating

the demand and supply side of tourism. Considering that some data sources can be used to estimate

both sides of tourism at the same time, this division has also been included in the compilation

(see Table 2).

As a result of the conducted works, it was found that among all the inventoried sources, those which

data can be used for estimating the demand side of tourism have largely prevailed

(82 sources).To estimate both the demand and supply 26 sources may be helpful for calculating and

22 for the supply side. The largest number of sources that could improve statistics on the demand side

of tourism was identified by the Netherlands (25 sources), Slovakia (13), Poland (10) and Italy (9).

In the case of sources useful for estimating the supply side of tourism the Netherlands identified the

largest number of sources (8). The Netherlands also showed the highest number of sources that can

be used to estimate both the demand and supply sides of tourism – over 40% of this type of sources

was identified by all partner countries.

Table 2. Availability of external data sources by country of origin of the source

Availability of

external sources

Total 2 6 9 5 22 13 6 12

Available sources 2 2 2 2 13 11 5 2

Sources not

available

(temporarily or

permanently)

- 4 7 3 9 2 1 10

External data sources include those that partner countries do not yet have access to as well as those

with limited availability. Analysing the availability of external data sources in all partner countries it

was found that 52% of them were accessible. The highest share of available external sources was

observed in Poland (almost 85%). Also Portugal showed a similar rate of availability of sources (83%).

It is worth noting that among 11 online sources identified by partner countries, 10 were reported by

the Netherlands and 1 by Poland. Table 3 presents availability of identified external data sources

broken down into available and unavailable ones.

Table 3. Breakdown of collected data sources according to their usefulness for estimating the demand or supply side of tourism

Side of tourism

Total 6 12 11 14 44 16 7 20

Supply side 1 1 1 3 8 2 2 4

Demand side 5 9 6 9 25 10 5 13

Both sides - 2 4 2 11 4 - 3

Among the available external data sources, those useful for estimating the demand side of tourism

prevail (Table 4). In Bulgaria, Germany and Greece all available external data sources are useful for

estimating demand side of tourism. The Netherlands reported only 31% of available external data

sources as useful for demand side of tourism, but almost 54% of data sources identified by this country

can be useful for both side of tourism.

Table 4. Breakdown of available external data sources according to their usefulness for estimating the demand or supply side of tourism

Side of tourism

Available external

sources in total 2 2 2 2 13 11 5 2

Supply side - - - 1 2 1 2 1

Demand side 2 2 2 1 4 7 3 1

Both sides - - - - 7 3 - -

The inventory of data sources identified by WPJ partners are have varying frequencies. Most of them

have a monthly frequency. Sources which data have weekly or even daily frequencies have also been

shown. They constitute a very important element in the construction of the system supporting official

data in the field of tourism due to the obtained high frequency time series, which will facilitate analysis

and allow more accurate capture of changes.

On the basis of the data sources inventory, the project partners have developed Flow Models for their

countries according to the accepted scheme. Each model presents proposed directions for combining

data collected from external sources and web scraping with data from official statistics. An important

issue regarding the Flow Model is the block breakdown of data sources that can be used to compile

statistics on the demand and supply side of tourism. One important element of the models developed

by particular countries was to present areas and directions in which the results of combining data from

inventoried sources (using statistical methods) in the future will be used to improve official statistics.

Proposed areas and directions:

Improving the completeness of tourist accommodation base,

Spatial disaggregation of data on tourist accommodation base,

Flash estimates of the use of tourist accommodation base,

Improving the quality of data on trips,

Improving the quality of tourists expenses data,

Improvement of satellite accounts,

City tourism,

Event related tourism,

Tourism potential,

Tourism attractiveness.

The models developed by the project partners have been adapted to the number and types of

inventoried sources in each country and to the areas where they will be applied in the short term.

Their brief characteristics, taking into account the type of external data sources planned to be used

and statistics potentially to be improved, broken down by country, is presented below.

A team from the Netherlands to improve the quality of tourism supply side data, in addition to data

from web scraping, intends to use information from the Register of non-categorized tourist

accommodation establishments (Chamber of Commerce (via Business Register), Register of Addressen

and Buildings (BAG 2019), and others (Locatus, Airports register). Among the external data sources

that may be useful in estimating the demand side are, among others, NBTC-NIPO Research Company

on holiday, recreation and other sources (Terrace Research, business travel, Museum and recreation

statistics). The planned results include: Improving the completeness of tourist accommodation base,

spatial disaggregation of data on tourist accommodation base, improving the quality of tourist’s

expenditure data and improvement of tourism satellite accounts.

In Poland, the data obtained using the web scraping method of web portals offering accommodation

are used as a basis for improving the quality of data on tourist accommodation. To improve quality,

additional information will be used from the Register of non-categorized tourist accommodation

establishments (EOT), which is run and updated in Poland by commune offices. Information obtained

from web scraping and data from Border guard data, Automatic Number Plate Recognition System

(ANPRS) and traffic sensors are used to improve the quality of estimates of data on tourism demand

(surveys of the participation of Polish residents in trips).

An important element in estimating the demand side of tourism can be data held by intelligent cities,

which, among other things, in their assumption, focus on the development of “smart” tourism using

the latest solutions and information technologies in conjunction with big data (Gretzel et al. 2016;

Wang et al. 2016). Polish partners continue to cooperate with the city of Rzeszów in the use of data

from smart systems in their possession. Among the identified data sources, there is a possibility to use

data on car traffic. The level of traffic volume is measured on the basis of monitoring located at the

largest intersections of the city monitored. Example data concerning car traffic at intersections in the

city of Rzeszów are presented in Table 5.

Table 5. Data from car intersections from Smart City Rzeszów system

Data on the number of cars entering and leaving the city, combined with data obtained from license

plates (from which the municipality/country of origin of the vehicle owner may be identified), can be

used to estimate the number and country of origin of tourists as well as same-day visitors. This is of

particular importance for cities that are close to the border, as well as for those that are particularly

popular with foreign visitors.

Furthermore, data from city parking meters can also be used to estimate the number of same-day

tourists. Systems supporting parking spaces in cities have registers containing data on car registration

and the hours in which a given car occupied a parking space in a selected place in the city. A good

example is the city of Rzeszów, which will be the first Polish city to introduce in 2021 video monitoring

of the state of free parking spaces based on the MESH 5G technology. The deployed cameras, sensors

and neural networks will collect and analyse information about parking cars 24 hours a day.

Knowledge about the number of daily arrivals of same-day visitors from abroad and the place where

their cars were parked in the city may be of particular importance when estimating the expenditure

of foreigners on shopping in Poland. Thanks to the above-mentioned system the number of vehicles

parking near shopping centres will be determined.

Other interesting data from Smart City systems that can be used for tourism statistics include data on

waste production and water consumption. According to Zorpus et al. (2014), on average, one hotel

tourist can produce 1 kg of waste per day which is undoubtedly a very interesting variable in terms of

the possibility of using it to estimate the number of foreign visitors to the city. Based on data on the

level of waste consumption in cities (daily, weekly), it will be possible to estimate the level of tourist

traffic (this especially applies to the cities with a high intensity of seasonal tourist traffic). Information

gathered from Smart City systems also enables the estimation of the number of participants in events

such as outdoor concerts and sports events, by monitoring the number of buses stopping at stops and

the people who use them. This system is currently being tested in Rzeszów. In the near future with

the help of ANPRS cameras, among other things, will allow for automatic recognition of license plates,

as well as monitoring of stopping time, size and type of vehicle at the stop and the area occupied by

the bus. Then, data from local servers will be filtered in the central system and will generate a report

on vehicles that have stopped at the stops. As evidenced by the example of the city of Rzeszów, great

opportunities are provided by the system that counts people entering/leaving the main square of the

city and collects data from traffic sensors located on the roads and promenades leading to the market

square where various types of events take place. Motion sensors located on the promenades leading

to the city square record all people entering and leaving the market.

The use of data from “smart” sources opens up new opportunities for tourism statistics. Cooperation

between city authorities with the above-mentioned systems, and the Statistical Offices may bring

mutual benefits and contribute to the change of the concept of tourist traffic management, inventing

innovative ways of managing it, and thus ultimately lead to an increase in tourist traffic in the region.

More accurate statistical data showing the level of tourist traffic in cities and produced on the basis

of combining data from official statistics with data from smart systems will translate into decisions

made by decision-makers regarding regional policy.

As a result of using statistical methods, i.e. combining and disaggregating data from these sources, it is

planned to achieve several goals. These include: improving the completeness of the tourist

accommodation establishments base, improving the quality of data on tourists' trips and expenses,

event tourism, developing flash estimates of the use of tourist accommodation establishments.

The effects of the described work and the results obtained may be used in the future for improvement

of Tourism Satellite Accounts (TSA).

In Portugal, estimates of the supply side of tourism were be compiled using data obtained from web

scraping and data from National Tourism Registration - Tourist Establishments and National Tourism

Registration - Local Accommodation. Data from tourism demand side surveys in future will be

supported in the future by data from web scraping and data from administrative sources - Airport

Data, Credit and Debit Card Transactions. As a result of combining data from these sources, it is

planned to achieve six expected results, including improvement of quality of tourists’ expenses data,

improvement of quality of data on trips, flash estimates for tourist accommodation and improvement

of Tourism Satellite Accounts (TSA).

In Italy, the improvement of the quality of the supply side data could be achieved by using web

scraping related data and data from administrative sources such as Water consumption and Waste

production. The estimation of tourism demand side could be improved on the basis of web scraped

data and using Survey on museums and similar institutions data, Water consumption data, Railway,

airport and port data and Border Traffic survey (BI) data. Exploiting the potential of the data

combination, ISTAT will aim to develop results that can contribute to the achievement of the

10 objectives. These include, among others, the following: improving the quality of data on trips,

improving the quality of tourist expenses data, improving the completeness of tourist accommodation

base, tourism potential and others.

Bulgaria has developed a Flow Model in which data from the Ministry of Interior, Airports and others

will be used to improve the quality of tourism supply side data. In the case of the demand side of

tourism, apart from the data obtained using the web scraping method, Bulgaria intends to obtain

additional information from the same sources as for the supply side. Based on data from the above-

mentioned sources, Bulgaria intends to obtain results for improving the completeness of tourist

accommodation base, flash estimates of the use of tourist accommodation base, improving the quality

of data on trips, improvement of satellite accounts.

In the Flow Model developed by Slovakia, the basis for improving the quality of tourism supply data,

in addition to data from web scraping, are data from Regulatory reporting FIN1-12 and Monthly bed

tax data including the overnights data in the cities Bratislava and Košice. The tourism demand side in

Slovakia will be additionally estimated based on data from web scraping and from an external source

- financial transaction data. As a result of using WPJ methods of combining data, Slovakia will attempt

to improve the quality of data on tourism trips and improve the completeness and data validation of

tourist accommodation establishments. Improving the quality of data in this area will allow the

Improvement of Tourism Satellite Accounts (TSA) in the future. In addition, data from BTB data (unique

Wi-Fi connection, etc.) can be used to determine Tourism potential, Tourism in selected cities or

regions, Tourism centre points (attractions).

Germany (Hesse) improves data quality of the supply side of tourism by using several data sources

gathered by web scraping and linked to official statistics (accommodation survey and business

register) in order to improve completeness of official tourism statistics as well as improve plausibility

checks and imputation of survey data with data stemming from web scraping. Additionally, Hesse

established a cooperation in order to use anonymous mobile network data to measure tourism flows

not covered by official statistics (day visits, overnight stays in small accommodation businesses). This

will contribute to improve the picture of the demand side of tourism. Other external data sources

include the number of visitors and travellers by plane and ship which contribute to the demand side

of tourism. Given data from financial transactions regarding the “sharing economy” become available

from cooperation with data holder (see workpackage G), they could be used to complete the picture

on the demand side of tourism. Additionally, there exist some surveys strongly related to tourism in

general and may cover expenses of tourists in detail. Either they have a very strong regional focus or

they do not contain information on expenses.

Greece has developed a Flow Model which focuses on improving the quality of tourism supply side

data. Estimates may be developed using data from web scraping and external data sources such as

the Tax Authority Register of Short-Term Lease Properties and Register of properties in Greece, which

are offered for short-term lease through digital platforms. By combining such data, Greece intends to

achieve the following goals: improving the completeness of tourist accommodation survey frame,

compiling flash estimates of tourist accommodation statistics.

1.2. Web scraping

Collecting data from websites related to tourism activities is one of the key ways to improve the quality

of data related to accommodation establishments as well as information on trips and expenditure.

Deliverable J1- ESSnet Methods for webscraping, data processing and analyses presents the Flow

Model (see Figure 1) for web scraping processes in the field of tourism.

It consists of three stages:

Stage 1: Methods for web scraping to gather the INPUT data, i.e.

identification of websites,

analysis of the catalogue of websites,

analysis of legal aspects.

Note that the collected data are not structured and not suitable for immediate analysis.

Stage 2: Data processing steps to PROCESS data, i.e.

implementation and saving of extracted data in a temporary database,

cleaning and standardization,

loading the transformed data into the data warehouse.

Stage 3: Data analysis, final stage to generate OUTPUT data:

analysis of the collected data.

Figure 1. Diagram of the web scraping process

During the project, work was carried out on identifying tourist portals that will provide the most

complete information on accommodation, transport or food, among other things. Based on the data

from https://www.similarweb.com (updated 01.08.2020), an analysis of the most popular websites

for the countries participating in the project was carried out. The results are summarized in Table 6.

Table 6. Most popular websites for accommodation and hotels

Portal Worldwide BG Hesse EL IT NL PL PT SK

Booking.com 1 1 1 1 1 1 1 1 1

Airbnb.com 2 3 2 2 2 2 2 2 2

Vrbo.com 3 4

Hotels.com 4

Trivago.com 5 3 3 3

Pochivka.bg 2

Visit.bg 4

Hometogo.de 3

Fewo-direkt.de 4

Traum-ferienwohnungen.de 5

Ekdromi.gr 4

Casevacanza.it 4

Bed-and-breakfast.it 5

Table 7. Most popular websites for accommodation and hotels (cont.)

Portal Worldwide BG Hesse EL IT NL PL PT SK

Landal.nl 3

Roompot.nl 4

Centerparcs.nl 5

Noclegi.pl 3

Profitroom.com 4

Noclegowo.pl 5

Megaubytovanie.sk 3

Ubytovanienaslovensku.eu 4

Sorger.sk 5

Based on the results obtained, the most popular websites in Europe are:

Booking.com

Airbnb.com

Trivago.com

The remaining places in the ranking are local websites offering services only within a given country.

However, it should be noted that the popularity ranking of websites is based on the number of visits

to the website, but it does not specify how the portal was searched. For websites that are available in

most European countries, searches for accommodation establishments may refer to those located in

a given country as well as abroad. Therefore, in the next stage, an exemplary number of offers

provided on international websites was checked and compared with the number of offers on selected

national websites. The results are presented in Table 8.

Table 8. Number of establishments in selected portals

Portal BG Hesse EL IT NL PL PT SK

Booking.com 4 718 3 944 5 871 22 032 13 490 8 130 5 205 2 566

Hotels.com 2 076 4 850 20 122 2 314 6 025 6 393 11 430 653

Nocowanie.pl 36 710 47 949 4 881

Pochivka.bg 17 226

traum-ferienwohnungen.de 702

Ekdromi.gr 401

Landal.nl 641

noclegi.pl 34 839

Megaubytovanie.sk 4 827

On this basis, it is not possible to clearly indicate the best and most complete website containing

accommodation establishments that could be subject to the web scraping process and constitute the

only source of information for all countries.

Attention should be also paid to the limitations of search engines on individual websites. Very often,

portals, especially international ones, limit the number of results presented at one time. The

Booking.com portal limits the number of results displayed for a given location to a maximum of 1 000,

the Airbnb.com portal presents a maximum of 300 offers for a single location, while most domestic

portals present all offers at the same time. Therefore, the preparation of web scraping processes for

international portals requires the development of more comprehensive solutions.

In summary, the best approach for countries is to retrieve data from both international and local

portals, and then subject the created database to cleaning and standardization processes. All the steps

necessary for this purpose and the description of the remaining stages in the model presented in

Figure 1 are described in detail in the above-mentioned deliverable J1.

1.3. Source characteristics

Objectives defined for WPJ – Innovative Tourism Statistics, state that “The increasing growth of

information leads to big data and information systems targeting to administer, analyse, aggregate and

visualise these data”. Variable or data issues, such as overlapping concepts, indirect relations between

sources and redundancies or inconsistencies of information, although not always obvious or

immediately recognizable, are inevitable, when combining big data sources with administrative

sources and official statistics.

Big data as well as administrative data are not produced for the purpose of official statistics and,

therefore, before integrating them in the production system of official statistics, potential quality,

methodological, legal, privacy and other issues should be identified and addressed while, when

multiple sources are to be combined their indirect interconnections should also be taken into account.

As the information expands horizontally (more domains or areas) and vertically (more information for

each domain or area), it will be increasingly difficult to manage, search, extract implicit information

and check for inconsistencies. Therefore, it is strategically imperative to prepare and present the

relevant information in a user-friendly interface capable to deal with increasing complexity.

This subchapter will address source and variable mapping concerning official statistics variables and

domains. The goal is to provide an interactive graphical representation of the relationships and

interconnections between data sources (multi-purpose data sources, survey data and web scraped

data), variables and domains, countries and experimental results.

This approach is inspired by network analysis which can be considered as map of relationships. These

relationships are composed by nodes (for example, statistical domains or data sources) and edges that

represent the connections between nodes. In this map of relationships not only different types of

nodes but also different types or intensities of connections (edges) can be represented. For example

if a particular source is very reliable or its data is highly relevant to a particular domain, a thicker line

than otherwise can graphically represent its connection.

Source and variable identification and taxonomy are part of a bottom-up process to produce a basis

for ontology. As data sources and variables are identified and described, it is possible to organize them

in categories with an underlying structure of connections. The end result provide domain knowledge

within a shared framework and contribute to reducing ambiguities and misinterpretations.

R Packages for network visualisation

Dedicated commercial and open source2 software for ontology development is widely available.

Nevertheless, it is possible to cover the basic principles with an already familiar language such as R.

This language offers some advantages as is an already a common tool in NSIs. As such, it will be

probably not only faster to produce and disseminate results as well to share data and code but also to

integrate them in current existing workflows. Discarding the need to learn a new tool is also an

important reason to consider.

Network visualization is valuable alternative to represent a complex and otherwise static set of heavily

interconnected data. In order to do this the visNetwork package3, an R interface to “vis.js” JavaScript

library were chosen for interactive visualization of networks consisting of nodes and edges. Although

it is not mandatory to know JavaScript to use this package, some basic knowledge is helpful for some

functionality such as action related events. The visNetwork is very flexible and accessible not only

because it is based on open-source software, but also because it works on any modern browser for up

to a few thousand nodes and edges.

The package has highly customizable options such as shapes, styles, colours, sizes, images but, most

importantly, interactive controls (highlight, collapsed nodes, selection, zoom, physics, movement of

noes, tooltip and events). Additionally it uses HTML canvas for rendering.

It is based on html widgets, so it is compatible with shiny, R Markdown documents, and RStudio

viewer. R’s ability to read and write in a multitude of formats makes it very appealing to this sort of

task. In future work, if it would become necessary to adopt a new tool, all the work produced could

very likely be used.

Network visualisation for WPJ

In order to create an interactive network, the graphic WPJ Flow Models from Bulgaria, Germany-

Hesse, Greece, Italy, the Netherlands, Poland, Portugal and Slovakia had to be “translated” in two

input data objects describing nodes and their connections (edges). This can be done using a tabular

format like CSV, XLSX or even R dataframe’s.

2 https://protege.stanford.edu/ 3 https://datastorm-open.github.io/visNetwork/

Figure 2. WPJ Flow Models “translated” in two input data objects describing nodes and their connections (edges)

This first step, illustrated in the figure above, requires a significant work of re-conceptualization in

order to make the different workflows compatible and consistent without tampering with their

substance. The majority of the countries used a similar template but its implementations and content

varied. Making them compatible is quite challenging.

The structure of the nodes tabular object is very simple. It only needs 8 fields:

Id: a unique identification numeric value.

Label: a string value to describe the ID identification.

Title: string value that will be displayed when the node is selected, or when the mouse pointer

hover on it.

Group: string value to be used as input to a combo-box that will allow selecting which layer of

information will be displayed. It can have multiple categories separated by a comma.

Value: a numeric value that defines the size of the object. It will vary according the number of

countries that used it.

Shape: string value that identifies the nature of the node.

Colour: string value that identifies the nature of the node.

url: string value, a web link that will be active and allow the user to open it in a browser upon

a click.

The Figure 3 presents a detail of the nodes tabular object. As mentioned, nodes can have multiple

shapes, colours, sizes which can be used as attributes. Multiple-labelled attributes can also be used in

a “group” column, so each node can have multiple attributes (or categories). This enables the user to

select thru a comb-box, different layers of information to be displayed. Edges can also have different

lengths, widths, arrows and formats.

Figure 3. A detail of the nodes tabular object

The structure of the edges tabular is also very simple:

From: string value, id of the node from where the connection starts.

from_label: string value, the according label of the id. It makes identification easier.

to: numeric value, id of the node to where the connection ends.

to_label: string value, the according label of the id. It makes identification easier.

Length: numeric value to define the size of the connection (edge).

Width: numeric value to define the width of the connection (edge).

Label: string value. Useful to comment a particular connection (edge).

Arrows: string value, defines the direction of the connection.

Figure 4. The structure of the edges tabular

Using input data (nodes and edges) in a tabular format has advantages when it comes to fill in the data

but has some evident disadvantages when dealing with import/export formats, managing file

locations, just to name a few. Therefore, the best option available is to integrate data in the R script

itself4. This way in just one text file both data and code to produce network visualization are available.

This also has obvious advantages when it comes to share and disseminate the results.

The script is then “self-contained” in the sense that the data is embedded in the code. For that purpose

it can be used the dput {base} R command that conveniently recreates an object, such as a dataframe.

It does so by writing an ASCII text representation of an R object to the RStudio console which in in turn

can be copied to the script itself. This way the user is not required to import data, he just needs to run

4 The complete script can be found in the wiki page of the project or be requested to rui.alves@ine.pt. An image detail can be found in Annex 6.

the code. The script is also heavily commented and organized in an outline layout. Comments include

an introduction to the script, what’s new about that particular version of the script and what’s new

about the input data (e.g. corrections, additions). In some cases there are even short descriptions of

the available options. This makes very easy for the user to understand each single step of the script.

The Figure 5 presents a detail of the R script in RStudio IDE.

Figure 5. A detail of the R script in RStudio IDE

RStudio’s feature “Show document outline” (Ctrl+Shift+O) that can be found in the top right corner of

the image above shows the complete outline of the script and makes it easier to browse. The outline,

the comments and the use of the pipe operator (%>%) are focused on usability so the user can feel

encouraged to experiment new options.

The script creates data and a visNetwork object, shows the result on RStudio viewer and saves it as an

HTML file. This can be useful for non R users or just simply to share. Nodes with more connections

(edges) are automatically placed on the centre while nodes with fewer options are relegated to the

periphery of the network thus giving centrality a sense of relevance. Overlapping of nodes is also

automatically minimized. As mentioned earlier, the size of the nodes is proportional to the number of

countries they relate to. The Figure 6 presents a general (static) view of the visNetwork “Overview of

the Inputs and Outputs of the Pilot Project on Innovative Tourism Statistics”.

Figure 6. Overview of the Inputs and Outputs of the Pilot Project on Innovative Tourism Statistics

Interactivity is a key feature in this visualization. Green navigation buttons on the bottom left and

bottom right of the network canvas can be used to navigate, zoom in and out and re-centre. There is

also available a multi-selection feature: a long click as well as a control-click will add to the selection.

Figure 7. Navigation buttons

When hovering a selected node, a tooltip presents additional information contained in the nodes

object more precisely in the “title” field. When a node is in fact an URL (in the case of this visNetwork,

it has the shape of a light-blue coloured triangle), the web link is showed, it can be accessed upon

double-click, causing it to open a new window with the related content.

Figure 8. Additional information contained in the nodes object

At any moment, the user can create a screen-shot of the network by using the “Export as PNG” button.

The resulting image will be saved to “Downloads” folder. This can be useful for extracting images of

different layers of insights that the complete visNetwork inherently holds.

Figure 9. “Export as PNG” button

Other important elements of interactivity are the two combo-boxes that allow filtering the data.

One presents a list of every node existing in the network and the other presents a unique list of

categories or attributes in the group column of the nodes input data. Each node can have multiple

attributes hence this combo-box will produce wide variety of outcomes. The Figure 10 presents how

selecting “Data Source: Survey Data” from the combo-box filters out all the elements that do not fall

in that category. This allows the user to, for example, produce images of separate networks for each

individual country represented.

Figure 10. Selection of “Data Source: Survey Data” from the combo-box

Finally the “click and drag” functionality allows the user to re-arrange the nodes in the available space

and to intuitively understand the weight each nodes has in the whole network. An important node

will drag along a significant number of others while an important one will barely have any impact.

The elastic properties of the edges make this “click and drag” functionality very appealing to explore

the network’s interconnections.

By clicking a node, more information is presented as a tool tip.

Figure 11. A node with more information

Dragging a node will give a sense of the importance it holds in the global network, since it will literally

pull all other direct or indirect connected nodes along with it. This inertia weight is proportional to its

relative importance to the whole network.

Figure 12. A dragged node with more information

The use of URL’s as showed in the Figure 13 represented as a light-blue triangle node is good way to

link and expand the visNetwork to other contents. In the present case, the URL’s only link to existing

webpages. Nevertheless if the visNetwork (in HTML format) is stored, for example in a shared network

drive, it is possible to link to documents with other formats such as PDF, MS Word, MS Excel, among

other. This feature will significantly expand the limits of depth and format of data that a visNetwork

is natively able to support.

Figure 13. The use URL link to existing webpages

This visualization tool is also found adequate to variable mapping as it is described in use case

WPJ.2.PT. Variable mapping adds a new layer of information to the previously developed visNetwork

“Overview of the Inputs and Outputs of the Pilot Project on Innovative Tourism Statistics (v5)”. Based

on the existing R script, this new version with extended content allows clearer and intuitively views

on the process on how variables are used in the workflow that intermediates data sources and

experimental results.

Describing an interactive tool in a static format such as written report (as in this case) is challenging

and will always come as short. Therefore, it is highly recommended to run the R script and use the

HTML version of this VisNetwork in order to fully benefit the experience of such a dynamic product.

A small video on how to use and make a visNetwork is available on https://youtu.be/cYETq0rIT9k

as extension of this work.

A quality evaluation framework for administrative- and web-data

The data structures found while scraping web data from the internet show some similarities with

administrative data sources, e.g. accommodation identifier, name of the accommodation and type of

accommodation. The main difference seems to be that administrative data sources have formal

taxonomies, while web data do not. In fact, users of online platforms like Hotels.com or Booking.com

have a number of degrees of freedom to name their accommodations, to classify them. They also may

leave some fields empty and can choose or define an own accommodation category to advertise on

the web.

In 2008, a framework for the evaluation of the quality of administrative and other secondary data

sources was introduced by Daas et al (2009). This framework puts forward a successive evaluation of

quality aspects that includes three hyper-dimensions: Source, Metadata, and Data. The relevancy of

this approach is that it highlights different quality aspects (dimensions) of a data source. Hence each

dimension may contain several quality indicators.

A quality indicator is measured (or estimated) by one or more qualitative or quantitative methods. An

important result of this sequential process is the fact that it efficiently guides the user in the evaluation

of the quality of sources and (meta) data. Moreover, a hierarchy is introduced among hyper-

dimensions. It prevents investing time and effort in the determination of quality aspects that may not

(yet) be relevant at for example an exploratory phase. Last, by evaluating each hyper-dimension,

informed decisions (and actions) on data sources can be rapidly (under)taken.

In this section, the checklist proposed by Daas et al. (2008b; 2009) was applied on an example of multi-

data sources. Basically, a comparison between the hyper-dimensions of an Administrative Data Source

(Register of Addresses and Buildings, BAG) and a Web Source (Hotels.com) is displayed in Table 9 and

Table 11. These tables focus on the Source and Metadata hyper-dimensions. Notice that third hyper-

dimension (Data) is kept outside the scope of the current Pilot Track WPJ project.

The Source hyper-dimension deals with the quality aspects related to the data source as a whole, the

data source keeper and the delivery of the data source to the NSI. The Source hyper-dimension is

composed of five quality dimensions: Supplier, Relevance, Privacy and Security, Delivery, and

Procedures. Table 9 lists the dimensions, quality indicators and measurement methods for the Source

hyper-dimension.

The Metadata hyper-dimension focuses on the metadata inherent to aspects of the data source.

Clarity of the definitions and completeness of the meta information are some of the quality aspects

included. The Metadata hyper-dimension is composed of four dimensions: Clarity, Comparability,

Unique keys, and Data treatment (by the data source keeper). The latter is a special case as it consists

of quality indicators used to determine whether the data source keeper performs any checks on

and/or modifies the data in the source. See Table 11.

Table 9. Dimensions, quality indicators, and methods for SOURCE (based on Daas et. al, 2009)

DIMENSIONS

QUALITY INDICATORS

METHODS

Admin data Web data

1. Supplier 1.1 Contact

Name of the data source Register of Addresses and Buildings

Hotels.com

Data source contact information Cadastral Agency Expedia Group

Contact Person at NSI WRGR SOTA

1.2 Purpose Reason for use of the data source provider Improve tourist accommodation base

2. Relevance 2.1 Usefulness

Importance of data source for NSI Include accommodations indicator and GPS data

Include listings accommodations and GPS

2.3 Information demand

Does the data source satisfy information demand?

Yes Yes

2.4 Response burden

Effect of data source use on response burden Unknown. Larger time series are needed to determine effect

3. Privacy and security

3.1 Legal provision

Basis for existence of data source

Formal cooperation Allowed for research and development

3.2 Confidentiality Does the Personal Data Protection Act apply? (GPDR) Yes Yes

Has use of data source been reported by NSI? Yes Yes (via user-agent scraper)

3.3 Compatibility Can the data source be directly used by NSI? Are there special requirements to be met before data usage? e.g. use of synthetic identifiers

Yes Yes

3.4 Security Manner in which the data source is send to NSI

Structural deliveries via secured channel

Public website

4. Delivery 4.1 Costs

Costs of using the data source

Handling costs of open data Retrieval and hand-ling costs public data

4.2 Arrangements

Are the terms of delivery documented? Flowchart n.a.

Frequency of deliveries Monthly Daily

Table 10. Dimensions, quality indicators, and methods for SOURCE (based on Daas et. al, 2009) (cont.)

DIMENSIONS

QUALITY INDICATORS

METHODS

Admin data Web data

4. Delivery 4.3 Punctuality

How punctual can the data source be delivered? Within a week after the end of the month n.a.

Rate at which exceptions are reported

Monthly n.a.

4.4 Selection What data are delivered? All cities; all dwellings n.a.

Do these comply with the requirements of NSI? Yes, but data lack of second home indicator Yes

5.Procedures 5.1 Data collection

Familiarity with the way the data are collected

Yes, structural (since 2012) and under alliance agreement (since 2019)

Empirical

5.2 Planned changes

Familiarity with planned changes of data source

Yes No

5.3 Feedback

Ways to communicate changes to NSI Per mail/telephone No

Contact data source supplier in case of trouble? Yes No

In which cases and why? Coverage problems or update problems No

5.4 Fall-back scenario (Risk estimation)

-Dependency risk of NSI High. There at least wo other statistics using same source

Medium. There are other data sources.

-Emergency measures when data source is not delivered according to agreement

Open data are available but editing is required

Table 11. Dimensions, quality indicators, and methods for METADATA (based on Daas et. al, 2009)

DIMENSIONS QUALITY INDICATORS METHODS Admin data Web data

1. Clarity Description 0 = missing 1 = ambiguous 2 = clear

1.1 Population Unit definition Clarity score of the definition

Residence object (house or building)

2 Tourist accommodation (hotel, holiday home or similar)

1.2 Classification variable Clarity score of the definition

Purpose of use City

Type of tourist accommodation Location

1.3 (Key) Count variable definition

Clarity score of the definition

Construction year Status dwelling Surface

Price Guest reviews Capacity

1.4 Time dimensions Clarity score of the definition

Monthly (update depends on city)

2 Daily (update depends on accommodation

1.5 Definition changes Familiarity with occurred changes

Yes (included in metadata)

2 No (self-check) 0

1.5. Decisions and actions: When one or more of the above quality indicators are scored ‘description unclear’ (score 1) or ‘description missing’ (score 0 ) the data source keeper needs to be contacted. Only when these issues are solved, evaluation may continue from here on. In all other cases evaluation stops here. Remark: Data source keeper of Web-data needs to be contacted. (Unavailability of accommodation capacity is a constraint.)

2. Comparability Description 0: missing 1: unequal: conversion is impossible 2: unequal, conversion is possible 3: equal (100% identical)

2.1 Population unit definition comparison

Comparability with NSI definition

Comparable definitions 2 Comparable definition 2

2.2 Classification variable definition comparison

Purpose of use City

Type of tourist accommodation Location

2.3 Count variable definition comparison

Construction year Status dwelling Surface (built)

Price Guest reviews Capacity

2.4 Time differences Comparability with NSI definition

Yes, per month 3 No 2

Table 12. Dimensions, quality indicators, and methods for METADATA (based on Daas et. al, 2009) (cont.)

DIMENSIONS QUALITY INDICATORS METHODS Admin data Web data

2.4. Decisions and actions: When the data source is used to replace or is used in addition to other data sources and some of the comparability indicators have scored ‘unequal and conversion is impossible’ (score 1) or ‘description missing’ (score 0), the data source cannot be used and the evaluation stops here. These scores are less relevant for Data sources that are used for new statis-tics. In the latter and all other cases, evaluation may continue. Remark: Web-data source can be partially used.

3. Unique keys 0: keys missing 1: keys unequal, conversion is impossible 2: keys unequal, conversion is possible 3: keys equal (100% identical)

3.1 Identification keys Presence of unique keys Residence object ID 2 HotelID 2

Comparability with unique keys used by NSI

Accommodation number 2 Accommodation number 2

3.2 Unique combinations of variables

Presence of useful combinations of variables

Postcode, house number, house letter latitude, longitude (CRS: Amersfoort)

Postal Code, street latitude, longitude (CRS : 4326)

3.3. Decisions and actions: Data sources that need to be linked to other sources and were found not to contain unique keys or unique combination of variables, cannot be used. When this is the case, evaluation should stop here. When the presence of unique keys or unique combination of variables is not known for a data sources, this should be investigated in more detail. Contacting the data source keeper might be required to solve this problem. In all other cases evaluation may continue. Remark: Web-data source can be used

DIMENSIONS QUALITY INDICATORS

METHODS Admin data

Web data

4. Data treatment (by data source keeper) Description 0 = Do not know 1 = No 2 = Yes

4.1 Checks Population unit checks performed

Yes 2 No (foreign address found)

Variable checks performed Yes (City, Purpose of Use)

Yes (City) Street

Combinations of variables checked

Yes, Postcode Street House number House letter

2 1 1 1

Yes, Postal Code Street House number House letter

1 1 1 1

Extreme value checks Yes, surface, construction year

Address Geolocation

4.2 Modifications Familiarity with data modifications

Yes, editing is done 2 No 1

Are modified values marked and how?

Yes, status variable is provided (changes in names)

2 No 1

Familiarity with default values used

Yes, original and corrected files are available

2 Yes, (self-check) 2

4.3. Decisions and actions: If in one or more of the above indicators a ‘do not know’ (score 0) is answered, the data source keeper needs to be contacted to clarify these issues. Remark: Web-Data scores low in checks and modifications

5.1. Conclusion METADATA 1 Is every question for each indicator answered?

1: no (describe which not and why) 2: yes

5.2. Conclusion METADATA 2 Do all the indicators in the Clarity, Comparability, and Unique key dimensions have a score of 2 or higher and in the Data treatment dimension a score of 1 or higher?

1: no 2: yes Go to Data-part of the evaluation procedure (under development) If no (score 1): Is this a problem for the NSI? 1: no, because current use of web data is for accommodations inventory ……………………………………………………………… …………………………………………………………………………………… ……………………………………………………………………………………. Go to Data-part of the evaluation procedure (under development) 2: yes, because…………………………………………………………….. ……………………………………………………………………………………. ………………………………………………………………………………..….. ………………………………………………………………………………..….. ………………………………………………………………………………….... STOP EVALUATION

Next steps

The ideal situation will be to evaluate and/or benchmark more data sources. However, due to time

constraints, this task has not be reached. Nevertheless, it is worth to mention that the work of

Daas et al (2019) contemplates this extension and provides also recommendation on who should fill

this checklist, i.e.

Who should use this checklist?

The checklist should be filled in by an internal (future) user of the data source and/or an expert for

the secondary data source. For the Source part it is advised to contact the NSI contact person for the

particular data source (if available).

The operationalization of the next step will mean that the Statistics Netherlands selects some

additional experiences on web scraping other platforms (e.g. Booking.com and Airbnb) that are

connected to the ESSnet Big Data WPJ. In Table 15 and Table 16 a qualitative comparison between

these extra web data was displayed. Herein the symbols denote the most commonly observed score:

Good (+); Reasonable (o); Poor (-) and Unclear (?).

Table 15 and Table 16 show that so far web data of Booking.com seems to be a new potential source

for Tourism Statistics. See also report of Spinder (2019).

Table 15. Evaluation results for the Source hyper-dimension

DIMENSIONS DATA SOURCES

Hotels.com Booking.com Airbnb.com Tripadvisor.com Skyscanner.net taxi2airport.com

1. Supplier + + + ? ? ? 2. Relevance + + + + + + 3. Privacy and security + + o ? ? ? 4. Delivery o o o ? ? ? 5. Procedures -/? +/o O ? ? ?

Table 16. Evaluation results for the Metadata hyper-dimension

DIMENSIONS DATA SOURCES

Hotels.com Booking.com Airbnb.com Tripadvisor.com Skyscanner.net taxi2airport.com

1. Clarity o/- + o/? ? ? ? 2. Comparability o + - ? ? ? 3. Unique keys o o/? o ? ? ? 4. Data treatment + o/- o ? ? ?

2. Methods used for new tourism data

2.1. Methodology for combining and disaggregating data

Among the many inventoried data sources related to tourism, web scraping databases, administrative

registers and surveys for tourist accommodation establishments are of particular importance for the

process of combining and disaggregating data. The combination of web scraping databases and

administrative registers of tourist accommodation establishments will allow for the acquisition of new

accommodation establishments. Thanks to the high frequency of data from web scraping of

accommodation portals, it is possible to disaggregate monthly data on the occupancy of

accommodation establishments into daily data.

2.1.1. Combining data

The linkage of data from web scraping with data from administrative registers and statistical survey

frames is largely dependent on an appropriate matching strategy between these sources. Therefore,

achieving a reliable and as complete as possible result in terms of the population of tourist

accommodation establishments requires cross-checking the aggregates available on tourism-related

Internet portals and the aggregates from the administrative records and surveys of tourist

accommodation establishments.

When the data from portals are collected, the first question that arises pertains to the number of the

establishments which occurred in the portal data as well as in the survey frame. It may be implemented

by data linkage (or data matching) of data from the survey and data from accommodation portals.

Data linkage

Data linkage enables to check if a given tourist accommodation establishment from the portals is also

present in the survey frame. There are two main approaches to this task: deterministic (or rule-based)

record linkage and probabilistic (or fuzzy) record linkage. In the case of the simplest deterministic

matching, an identifier or a group of identifiers across databases are compared. Two records are linked

when all of the identifiers agree. A single unique identifier, e.g. some ID from the business register, can

serve as a matching variable. If it is not available then a group of identifiers can be used, e.g. postal

code and address.

The scraped data contain several identifiers that can be used for the linkage:

name of an accommodation establishment,

address,

postal code,

longitude and latitude.

The linking process is however hampered by a number of difficulties. Postal codes and addresses from

accommodation portals often contain typographical and data entry errors, such as transposed digits

and misspellings. This situation may occur if the portal does not force a standard format when entering

data (or uses a different scheme than the business register or addresses from the survey frame).

Sometimes the address is hidden in the name of establishment while the field address is not filled in.

On the other hand, geographical coordinates represented by pins on the map are sometimes absent

or not placed precisely near the building entrance or in the centre of the building. Additionally, some

hotels may choose their gate to the parking lot as their GPS-address.

As a result, there is no completely reliable identifier that can be used for linkage. Thus, the data linkage

cannot be conducted without preparatory data pre-processing and additional tools.

The consecutive subsection present two approaches for data linkage: the first one is based on

geographical coordinates and distances between accommodation establishments, the second one is

based on comparing text strings formed from addresses and postal codes of accommodation

establishments.

Distance-based linkage

In the first step, addresses and postal codes need to be to the same format. Many rules standardizing

text strings may be implemented with the so-called regular expressions. For more details see,

e.g. Friedl J. (2006). Standardization is recommended for every type of data linking. Data

standardization enables to analyse and use data in a consistent manner. Unfortunately, such

standardization is difficult for addresses. The variety of names of streets and the variety of misspellings

make it hard to describe all possible regular expressions to handle the problem.

This results in the difficulty to perform deterministic linking based on addresses but it still is very

helpful during next step of data linkage, i.e. deriving latitude and longitude from (pre-processed or

standardized) addresses. These coordinates can be obtained using any geocoding software. Some

countries have the knowledge and tools needed to implement the geolocation process. For other

countries WPJ developed a tool for geolocation of address of accommodation establishments from the

survey frame and web scraping. The solution was prepared in JavaScript language and uses HERE Maps

API. HERE Maps are precise and up to date, so they are often used for car navigation. The variety of

available functionalities in the application, as well as extensive documentation and technical support,

allowed to build a universal tool being a part of the prototype created by WPJ. A very important

function used in the developed solution is an automatic parsing of address data to a common structure,

which significantly reduces the time needed to prepare batch files for the used tool. The more work is

done in address standardization, the better performance of HERE Maps API is achieved. In a case that

a given address is not recognized at all by the geolocation tool, latitude and longitude can be taken

from the portal. In order to allow analysis and selection of the optimal tool, each partner - while

working on the geolocation process and preparing case studies - has chosen the tool that will be best

suitable for their situation.

If the geographical coordinates are available for the addresses from the survey as well as from the

accommodation portal, then the distance-based approach can be applied. In this approach there is no

need for the establishments from different data sources to have exactly the same coordinates.

Nevertheless, the coordinates in all data sources must come from the reference coordinate system

which uniquely define the positions of the points in space. It is common to use WGS 84 - which is also

used in the Global Positioning System (GPS). Also prominent web map services often use this

coordinate reference system (CRS). And so do many internet portals when using these prominent web

services when showing points on maps.

The distance-based approach can be applied in the following way:

1. Calculate the distance between all establishments from the survey frame and all

establishments from scraped data. This can be done with, e.g. Haversine formulae or

Vincenty’s formulae5.

2. For each establishment in scraped data find all establishments in the survey frame for which

the distance between the establishment from scraped data and establishments from the

survey frame does not exceed a threshold.

3. For each establishment in scraped data, match the closest one found in the second step.

In the second step, various thresholds can be chosen with respect to, e.g. accommodation type groups.

For instance, hotels – in general – are larger than apartments or holiday homes. Moreover, hotels have

sometimes more than one entrance which results in two different sets of coordinates.

In a deterministic approach, establishments are paired if the distance between their addresses or

geolocations does not exceed a critical value. Setting this threshold plays a crucial role in this method:

using a threshold too low (i.e. only a small distance is allowed) will result in too many False Negative

pairs. Allowing for larger distances possibly introduces too many False Positive pairs.

String matching

Techniques based on comparing text strings composed of addresses, place names and postal codes of

accommodation establishments can be used to pair them. Among the many metrics for comparing text

strings one can mention:

Levenshtein distance (Levenshtein V. I. (1966)).

Jaro-Winkler distance (Winkler W. E. (1990)).

Jaccard index (Jaccard P. (1901)) for n-grams (Weaver W. (1955))

Cosine distance for n-grams (Sidorov G., Gelbukh A., Gómez-Adorno H., Pinto D. (2014).)

Levenshtein distance takes into account addition, deletion, and replacement of characters - the more

such operations are needed to turn one string into another, the greater the distance. The maximum

distance between two strings is equal to the length of the longer strings. The Jaro-Winkler distance

takes into account letter transpositions and the occurrence of a prefix. The Jaro-Winkler distance is

normalized but it does not satisfy the triangle inequality, so it is not a metric in a strict sense. The next

two methods are based on the concept of n-grams. The N gram of a given word is the sequence of n

characters in that word. For example, the word “major” has three 3-grams: “maj”, “ajo”, “jor” and four

2-grams: “ma”, “aj”, “jo”, ”or”. The Jaccard index is the quotient of the number of common n-grams

divided by the number of all different n-grams in both words. If a dictionary of n-grams is created on

the basis of, e.g. all compared strings, then each string can be assigned a vector of binary variables

describing the presence of n-grams in them. Then, for every two vectors, the cosine of the angle

between them can be determined, calculated as the quotient of their dot product and the product of

their norms.

In a deterministic approach, establishments are paired if the chosen similarity measure does not fall

below a predetermined critical value. Again, setting the threshold plays a crucial role. Similar to the

distance-based approach described above, a critical value for string similarity chosen too high will

5 Haversine formula is based on spherical trigonometry whereas Vincenty’s formulae is based on ellipsoidal trigonometry and requires solving iteratively a system of (truncated) equations.

result in more falsely missing pairs. On the other side, a critical value chosen too low will produce more

False Positive pairs.

Evaluation of data linkage

Quality of matching can be checked by reviewing names and addresses of paired establishments and

deriving the number of correctly and incorrectly matched and not-matched establishments.

After matching of accommodation establishments, four situations may happen:

1. Establishment was paired correctly (True Positive, TP).

2. Two establishments were incorrectly paired due to small distance between them (distance

smaller than threshold) (False Positive, FP).

3. Establishment was not paired correctly due to high distance (True Negative, TN).

4. Two establishments were not paired incorrectly due to high distance between them (False

Negative, FN).

Results of matching can be summarized in the form of a confusion matrix, which is often used when

evaluating the performance of a classification model (see Table 17). For more details see,

e.g. Christen et al. (2007).

Table 17. Confusion matrix for data linkage results

Actual

Total Population (all establishments in

the portals and survey frames)

Match (establishment is present in portals and

survey frame)

Non-match (establishment is present

in portals only)

Predicted Match

true matches True Positives (TP)

false matches False Positives (FP)

Non-match false non-matches

False Negatives (FN) true non-matches

True Negatives (TN)

Increasing the threshold (i.e. the distance between a pair of coordinates) generates a higher number

of True Positives and False Positives and, at the same time, a lower number of True Negatives and False

Negatives – it is a trade-off. To find an optimal threshold several indicators can be used. Details can be

found in Annex 1 – Quality indicators of data matching.

Improvement of quality of frame for survey of tourist accommodation establishments

The database of establishments created by data linkage of the statistical frame and scraped data

(hereafter referred to as joint database) is the final output of the methodology for data linkage. This

database contains three subsets:

1. Establishments occurred in the statistical frame and scraped data - fully described by both data

sources.

2. Establishments occurred in the statistical frame only - described by statistical data in

a sufficient way.

3. Establishments occurred in the scraped data only - relevant variables need to be derived.

There are several variables possibly not covered by the scraped data or obtained by web scraping with

uncertainty and possible misclassification, e.g.:

Type of accommodation with respect to statistical classification.

Information about the accommodation type is present in the portals but this does not

necessarily match the accommodation type in the statistical frame. A set of possible

accommodation types in the portals usually overlaps only partially accommodation types used

in official statistics. Moreover, accommodation establishment may be, e.g. labelled hotel in the

portals even if it is not a hotel according to statistical definitions.

Number of bed places.

It is usually not possible to derive a total number of bed places directly from the portals by

a simple query. The reason is threefold:

o only some of the establishment’s rooms are offered on the portals,

o some of the rooms offered on the portals are already booked (not displayed when

querying),

o there is a limit on the number of adults in a query.

Single months in which establishment is (not) operating.

Preliminary analysis of scraped data revealed that on average hotels have higher prices, star rating,

guest reviews rating and total number of reviews than other accommodation establishment types.

Hotels also operate during the whole year more often than other establishments. Thus, all of these

variables may be helpful to predict accommodation type for the establishments present only in the

scraped data.

A subset of establishments in the joint database, common for scraped data and statistical data,

contains information about the true accommodation type used in official statistics and months in

which the establishment operates as well as variables obtained from the portals (listed above).

This subset will be called the training set. One of the well-known classification method is a classification

tree. Details of the method are provided in, e.g. Breiman et al. (1984). Thanks to the classification

model, the type of accommodation used in official statistics can be assigned to all establishments that

are only available in web scraped data. A similar procedure can be applied for assigning months in

which the accommodation establishment is operating. The number of bed places can be estimated

with a regression tree. Implementations of classification and regression trees are almost the same.

Several implementations of machine learning (ML) methods in R can be found in Brownlee (2017).

Model selection in any ML method must be implemented carefully. To this end, there are several

approaches that can be used e.g. cross-validation or bootstrapping. Details can be found in,

e.g. Kohavi R. (1995). K-fold cross-validation and bootstrapping are implemented in R packages,

e.g. caret.

Finally, after predictions for the missing information are made, information about all required variables

is available for all of the establishments in the joint database. Since geolocation data, the

nomenclature of territorial units for statistics (NUTS), longitude and latitude coordinates are available

on micro unit level, the data can be aggregated on any spatial level.

2.1.2. Spatial-temporal disaggregation of tourism data

In the majority of European Union countries, the survey of the occupancy of tourist accommodation

establishments is a monthly survey. The structure of the form does not allow to answer the following

questions:

How many tourists stayed overnight during the last weekend of the month?

How many tourists stayed at the accommodation establishment on each day of the month?

Estimating the daily number of tourists or overnight stays, taking into account the known monthly

values of these aggregates, is an example of the problem of temporal disaggregation.

Data sources

Two data sources are used for temporal disaggregation: low-frequency data, e.g. monthly data that

are a subject to disaggregation as well as high-frequency data, e.g. daily data containing auxiliary

variables. In the case of disaggregation of data on the occupancy of accommodation establishments,

they include:

Data from internet portals offering accommodation.

The data are collected using the web scraping method with a daily frequency and include

information such as: location of the accommodation establishment (address and/or

geographical coordinates), type of establishment, number of stars, guest rating and

accommodation price. The collected data are used to calculate daily statistics of continuous

variables, such as price, broken down by type of establishment and its location. Missing data

should be imputed. The data obtained by means of web scraping include both information on

the establishments that are included in the survey of tourist accommodation establishments

and those that do not participate in this survey (new establishments not yet covered by the

survey).

Data from the survey of the tourist accommodation establishments.

The survey data include information on accommodation establishments, such as: the number

of tourists accommodated, the number of nights spent and, in the case of some types of

establishments, also the number of rooms rented. Depending on the country in which the

survey is conducted, it may cover all establishments or only a part of them, e.g. establishments

with 10 or more bed places.

Estimation procedure

The first step in temporal disaggregation is regression. Using auxiliary variables, a regression model is

built and preliminary estimates of the variable are prepared. The simplest approach selects one

variable (Dagum E.B., Cholette P.A. (2006)) and estimates the model using the classical method of least

squares, or selects multiple variables and estimates the model using the generalized least squares

method (Chow G.C., Lin A.L. (1971), Fernandez R.B. (1981), Litterman R.B. (1983)).

Typically, daily aggregates do not add up to a monthly value. Hence, the second stage of temporal

disaggregation is benchmarking. The difference between the monthly value and the daily aggregates

must be estimated for each day. The simplest version of additive or multiplicative benchmarking is

described in the works of Denton F.T. (1971), Di Fonzo T. (2003). Other versions of benchmarking

include modelling the random component for daily data. The random component can also be modelled

as a first-order autoregressive process (Chow G.C, Lin A.L. (1971), Fernandez, R.B. (1981)) or a random

walk process (Litterman R.B. (1983)). Details can be found in Annex 2 – General approach for data

disaggregation.

It may occur that the results of the temporal disaggregation are unacceptable, e.g. negative. This is

most often the result of poor regression results. In this case, one can take the logarithm of the

dependent variable. Regression results are always positive after high frequency data modelling.

Another possibility is to use the idea of “shrinkage”. This method reduces the mean square error at

the expense of biasing the results. Examples of the use of this method are:

• least absolute shrinkage and selection operator, abbreviated as LASSO (Tibshirani R. (1996)),

• ridge regression (Hoerl A. E ., Kennard R. W. (1970))

In general, both methods can be described as least squares with a condition that normalizes the

parameter vector, but the way they work is slightly different.

The LASSO method adds the variables to the model sequentially depending on the value of the

regularization parameter. The higher its value, the closer the results to those obtained with the

classical method of least squares. For ridge regression, there is also a non-negative regularization

parameter, but all the variables are included in the model at once. If the parameter is zero, ridge

regression becomes least squares. The higher the parameter value, the closer the results are to the

mean value of the forecasted variable.

Both methods generate several sets of forecasts. The researcher selects the result set that is closest to

the least squares result and is, at the same time, acceptable.

Precision measures and tests for temporal disaggregation

In the case of regression, there are many measures of precision such as R-squared, Mean Absolute

Percentage Error (MAPE), Root Mean Squared Error (RMSE), etc. These measures are calculated from

the true and forecasted values. The difficulty in measuring precision for temporal disaggregation is that

there are no actual values for daily data. It cannot be assumed that the precision measures for monthly

data will be analogous to the measures for daily data. This is due to the temporal disaggregation

benchmarking stage distorts the forecasts obtained at the regression stage. To account for this, one

should calculate precision measures for errors for daily data and for errors for monthly data.

The result can be tested in terms of weekly and annual seasonality. The tests include:

• F-Test for binary variables (Maravall A. (2011)),

• ANOVA Welch test (Welch B. L. (1951)),

• Kruskall and Wallis test (Kruskal W. H. and W. A. Wallis (1952)).

If the null hypothesis is rejected, the seasonality is significant.

2.2. Flash estimates

According to the Regulation (EU) No 692/2011 of the European Parliament and of the Council of

6 July 2011 concerning European statistics on tourism as well as the Commission Delegated Regulation

(EU) 2019/1681 of 1 August 2019 amending Regulation (EU) No 692/2011 of the European Parliament

and of the Council concerning European statistics on tourism, European Union countries should submit

to Eurostat data on accommodation statistics in the following scope:

1. For “hotels and similar accommodation” and for “holidays and other short-stay

accommodation”, the scope of observation should at least include all tourist accommodation

establishments having 10 or more bed places.

2. For “camping grounds, recreational vehicle parks and trailer parks”, the scope of observation

should at least include all tourist accommodation establishments having 10 or more places.

3. Member States accounting for less than 1% of the total annual number of nights spent at

tourist accommodation establishments in the European Union may further reduce the scope

of observation, to at least include all tourist accommodation establishments having 20 or more

bed places (20 or more places).

The scope of information about tourist accommodation establishments provided to Eurostat can

therefore be and in fact is different between countries. In many countries establishments having less

than ten bed places are not covered by the surveys, additionally different thresholds are used in

specific countries to collect information on establishments. The frequency of data collection is not very

diverse and usually data are collected on a monthly basis (in the case of project partners, all countries

collect information on a monthly basis), while differences often occur in the observed population of

establishments as well as methods of data collection.

This is an important issue when developing flash monthly estimates concerning capacity of tourist

accommodation establishments as specific countries has a different scope of information for use based

on surveys. Table 18 shows the scope and frequency of survey on tourist accommodation

establishments in partner countries

Table 18. Data collection on accommodation establishments in partner countries

Country Surveys on accommodation establishments

Monthly Yearly

BG Establishments with 10 or more bed places

DE Establishments with 10 or more bed places

EL Without threshold

IT Without threshold

Only rural

accommodation

establishments

Sample of establishments is monthly drawn from yearly survey, which is based on opening periods and bounded to capacity restriction: 55.1 – Hotels, hostels and B&Bs with more than 5 bed places

55.2 – Rental houses/apartments/youth hostels with more than 10

bed places

55.3 – Camping sites with more than 4 pitches

Without threshold

PL Establishments with 10 or more bed places Establishments with 9 or

less bed places

Hotel establishments (hotels, apartment hotels, tourist apartments,

tourist villages, pousadas and quintas da Madeira) – without

threshold

Local accommodation with 10 or more beds

Rural/lodging tourism – without threshold

Table 19. Data collection on accommodation establishments in partner countries (cont.)

Country Surveys on accommodation establishments

Monthly Yearly

Establishments without threshold regarding the number of bed

places

Statistical units (legal persons and natural persons/entrepreneurs

with the Company Registration Number) – Accommodation

establishments

Description of data sources

Two data sources were used to estimate the variables of interest in terms of the occupancy of tourist

accommodation establishments:

• data from web scraping of tourist accommodation portals;

• data from the survey on the use of the accommodation base.

Data from the web scraping of accommodation booking portals contain information about the

accommodation establishment, i.e. its location and data on the offer of this entity, the most important

of which are the price and type of the accommodation establishment. Then, data from web scraping

is used to calculate monthly aggregates, thanks to which information about the distribution of prices

in the main types of accommodation establishments as well as the number of offers in these

establishments are gathered.

Data from a statistical survey - data from a monthly survey on the occupancy of tourist accommodation

establishments, which covers entities of the national economy conducting activities classified

according to NACE to the groups: 55.1, 55.2 and 55.3. The aggregated data from this survey includes

information on the number of establishments and their types in a given region (voivodship), the

number of guests, the number of nights spent and the number of rented rooms.

Methodology

Data aggregated into monthly series from both sources described above are combined. It should be

noted that data from web scraping can be obtained almost on an ongoing basis, while the data from

the survey are available with a delay of several months. Missing values from the survey of

accommodation establishments will be forecasted on the basis of available data from web scraping.

After combining the data, a set of explanatory variables is determined for each explained variable.

The developed algorithm determines these variables in two stages.

In the first stage, a correlation matrix is used, thanks to which only the variables best correlated with

the explanatory variable are selected as potential candidates for the model.

The second stage uses the method of Hellwig’s information capacity indicators (Hellwig 1968,

Hellwig 1972), the idea of which is to select such explanatory variables that are strongly correlated

with the explanatory variable, and at the same time poorly correlated with each other. In order to use

this method, the vector of the correlation of the dependent variable with the explanatory variables

and the correlation matrix of the explanatory variables should be calculated. The method considers all

combinations of potential explanatory variables. For each combination of potential explanatory

variables, information capacity indicators are calculated: individual and integral. Individual and integral

indices of information capacity are normalized in the interval [0.1] and the value they take is the higher

the more the explanatory variables are correlated with the dependent variable and the weaker they

are correlated with each other.

The combination of the variables corresponding to the maximum value of the integral information

capacity of the indicators is selected as the set of explanatory variables.

Then a regression model is built based on the selected variables to estimate the missing values. On the

basis of the preliminary results, it can be concluded that even the linear regression model estimated

by the least squares method has good prognostic properties. With sufficiently long time series it is

possible to make forecasts based on the ARIMAX model (Autoregressive Integrated Moving Average

with Explanatory Variable).

2.3. Methodology to improve the quality of data in various statistical areas

In almost every EU’s country National Statistical Institutes conduct a household sample survey on

participation of residents in trips. Due to the level of detail of a questionnaire or small subsample size

for some destination countries, the research have large variability in expenditure as well as the

problem of non-response related to expenditure by expenditure category with respect to some

destination countries. If a country domain is not well represented then there may be no representative

in a sample for some strata and the results, e.g. expenditures for business trips cannot be estimated.

Countries that are visited by a different route than air are usually very numerous in the sample survey.

Most travel destinations that are rare in the sample survey are visited by air. Therefore, the emphasis

was put on this mode of transport.

The methodology for estimating the number of trips in air traffic was developed in the Big Data I

project. Within the framework of Big Data II project this methodology has been enhanced and linked

to the methodology of expenditure estimation.

In the presented approach there are several data sources to combine to estimate trips and

expenditures:

destination airports available from the national airports,

a list of airports all over the world covering airport name, city, country, International Air

Transport Association (IATA), International Civil Aviation Organization (ICAO) and Federal

Aviation Administration (FAA) codes,

data on flight routes consisting of origin, hub and destination airports, IATA and ICAO codes,

type of aircraft, date of arrival and departure,

data on technical information on aircrafts about the type of aircraft, airline, ID, total number

of seats and seats in each class,

administrative data from Civil Aviation Offices (CAO) on the number of passengers from

national airports in domestic and foreign traffic. Foreign traffic data covered only direct flights

from national airports which gives much less countries than from the sample survey,

crucial statistics of trips from the sample survey (share of residents in total international air

traffic),

prices and costs of flight tickets, accommodation, local transportation, restaurants and cafes,

etc., obtained by web scraping of relevant portals,

data on expenditure from the sample survey.

The next figure shows which data sources can be combined.

Figure 14. Flowchart of trips and expenditure estimation

2.3.1. Estimations of the size of tourist traffic

In this section, a method for estimating a size of tourist traffic is presented. This approach is based on

simple linkage of several data sources using unique identifiers and calculations of conditionals

distribution. Data on flights usually covers name and location of an airport, IATA and ICAO codes, type

of aircraft, date of arrival and departure etc. Country name, type of aircraft, IATA and ICAO codes are

the keys to join several data sources into one database.

Distribution of trips with respect to air traffic can be estimated with following procedure:

Collect all origin and destination airports with a use of an online flight connection search

engine.

Use flights schedules to derive a distribution of flights for each origin airport.

Attach a country where the airports are located with airport code lists (IATA, ICAO, FAA).

Attach a capacity of aircraft (seats) with data on technical information on aircrafts.

Calculate a distribution of flights (measured with seats) for each airport available from national

airports. The destination airports identified in this step will be called the hub airports.

Calculate a distribution of flights for each hub airport. Repeat this step until all destination

airports are reached.

Remove all routes that are irrelevant with respect to time or cost efficiency. For instance, the

route Stokholm – Lisboa – Rio de Janeiro seems to be fine but Lisboa – Kiev – Rio de Janeiro

rather not. Time and cost efficiency does not provide an unambiguous rule. Nevertheless, the

more obviously irrelevant routes are removed, the better the results.

Use a data from civil aviation office to benchmark distribution of flights from a given origin

airports to the known total.

For each hub airport calculate the number of passengers that travel further using a relevant

statistic from the sample survey (share of tourist using airports from a given country as a hub).

Sum up passengers from all routes: travelling directly from origin country, travelling with a use

of one hub airport, etc.

The analysis of the size of tourist traffic showed that the distribution of trips in air traffic based only

on big data is different compared to the distribution of trips in the sample survey, as it represents the

preferences both of the inhabitants of a given country and of other Europeans who use airports in this

country, as transfer airports. This means that the trip estimator is biased. On the other hand, estimates

of trips based on big data show less variability over time and greater completeness than the results

from the sample survey. Therefore, it seems that adopting results based only on big data is not

a satisfactory choice. The question arises of how to combine two sets of trip data where one is

unloaded but its precision is not acceptable at very low aggregation levels and the other is biased but

has far greater stability of the results at low aggregation levels.

In general, the problem of combining weighted and unbiased estimators has been studied by many

researchers. One of the proposed solutions is the James-Stein estimator (Green E. J.,

Strawderman W. E. (1991), Fourdrinier D., Wells M. T. (2012)). The James-Stein estimator can be used

to obtain the final distribution of trips. Details of its use can be found in Annex 3 – Combining biased

and unbiased data.

2.3.2. Estimation of expenses related to trips

In the presented approach, there are two groups of data sources used to build the expenditure

database: the sample survey of trips and data from various portals obtained through web scraping.

The basic data sources that can be linked to air trip expenditure include websites offering the booking

and sale of accommodation, airline tickets, and websites with catering establishments. Through web

scraping, one can collect data on, among other things:

• prices of accommodation by type of accommodation establishment,

• ticket prices by airline and route,

• average prices of meals in catering establishments (on websites, in the restaurant description,

price ranges for standard dishes are often given and not prices for specific meals).

These data can be divided into two groups: data that should be collected frequently due to rapid

changes, e.g. flight tickets prices, accommodation prices, and data that can be collected with low

frequency, e.g. food and beverages, local transportation costs.

The data from the sample survey on trips should cover several consecutive years to include a wide

range of trips with respect to their descriptions (e.g. destination country, purpose of trip, type of

accommodation, means of transportation) and their expenditure.

The use of big data to estimate trips with the James-Stein estimator increases the number of countries

with estimated trips in relation to the results of the sample survey itself. Since some distant destination

countries may not be well-represented in the expenditure database, auxiliary information is needed

to improve the estimation of expenditure e.g. scraped data. These data cannot be linked to the data

from the sample survey directly. For instance, there is a trip to Brazil, means of transport: plane,

accommodation establishment type: hotel. In the scraped data there are many offers of hotels in Brazil

in a given period of time as well as many offers of flight tickets. Thus, the scraped data need to be

aggregated: only some statistics should be derived from the data.

The data analysis showed that the distribution of accommodation prices has strong right asymmetry

and outliers. After comparing various statistics of accommodation prices from portals with the average

expenses per night provided by the survey on trips, it turned out that the median price describes better

the variability of accommodation expenditure in the survey.

First, the case of estimating expenditure for countries that did not appear in the sample survey in

a given period, but appeared in historical data will be presented.

In the first step, a database must be prepared. It is a subset of a database from the survey on trips of

past data containing data for trips including descriptive variables such as destination country, purpose

of trip, type of accommodation as well as the expenditure by category. There must be no missing values

in the expenditure variables.

After aggregating the data from big data sources to the form of statistics, they can be added to the

database of individual trip survey data according to the keys, e.g. country, type of accommodation

establishment and quarter. By using the combined big data and the categorical variables describing

trips, it is possible to estimate travel expenses and average expenditure for accommodation. If the

number of trips is already given, then only the average expenditure needs to be estimated. The total

expenditures will be obtained by multiplying the number of trips and the average expenditure

(by expenditure category).

When the database is ready, there is the need to choose a method for prediction. Among others, the

following methods may be used:

Regression Trees (Breiman, et al. (1984)),

Support Vector Machine (Cristianini, et al. (2000)),

Random Forest (Ho, (1998)),

Optimal Weighted Nearest Neighbour Classifiers (Samworth (2012)),

Predictive Mean Matching (Rubin, (1986)),

Bayesian Linear Regression (Box, et al. (1973)),

Random Indicator for Nonignorable Data (Jolani, (2012)).

These methods are available in rpart, kknn, e1071, randomForest, mice, and caret packages in R.

Comparison and selection of the prediction model may be based on methods of cross-validation or

bootstrapping with a use of e.g. Root of Mean Squared Error (RMSE) or Mean Absolute Percentage

Error (MAPE). The best results in terms of RMSE were achieved by Optimal Weighted Nearest

Neighbour Classifier and Regression Tree. The best results in terms of RMSE were achieved by Support

Vector Machine. Moreover, results were always admissible for these three methods while in the case

of some other methods the results sometimes were e.g. negative.

It is possible to predict the expenditures for a new dataset with the use of the selected model.

To create the new dataset for a given country:

Pick trips to this country from the past data from the same quarter (or all past data if it is not

available).

Find in the whole dataset k the most similar trips to the selected one in the previous step with

a use of kknn package. Similarity is based on trip characteristics, expenditures, and big data.

Now, average expenditures can be estimated with a use of the selected model for all records in the

dataset for a given country. Furthermore, total expenditures can be estimated as well in some

breakdowns as this country is represented by several trip records.

The meta model can be used to estimate expenses for a country that has never appeared in the sample

survey but nevertheless exists in big data sources. For this purpose, the same database can be used,

but the model is estimated without categorical variables describing the trip. Then, the average

expenditure per tourist is estimated only on big data. By using the kknn package, it is possible to find

countries that are most similar to a given country. Then, the previously described procedure can be

used to obtain data in different sections.

2.3.3. Tourism Satellite Accounts

Tourism is a specific type of economic activity that is defined not in terms of production, like other

economic activities, but in terms of consumption. In the case of tourism, the factor determining the

scope of the phenomenon is the way products are used, not the way they are produced. Therefore, it

is not distinguished neither in terms of institutional sectors nor in terms of types of activity or products

and services, which makes it impossible to analyse it within the standard system of national accounts.

For this reason, the methodology of national accounts provides the possibility of introducing tools

enabling the analysis of phenomena such as tourism, in functional terms. These tools are known as

satellite accounts because they are, in some measure, satellites of the main accounts. These accounts

provide a comprehensive view and the possibility of economic analysis of the tourism sector by

balancing tourism-related supply and demand and estimating the direct impact of tourism on the main

macro-aggregates of a country’s economy.

The methodological framework of the Tourism Satellite Account: Recommended Methodological

Framework 2008 (TSA: RMF 2008) was developed by the World Tourism Organization (UNWTO), the

United Nations Statistics Division (UNSD), the Organization for Economic Cooperation and

Development (OECD) and the Statistical Office of the European Union (Eurostat). This account is

harmonized with national accounts, using the same concepts, definitions and classifications as national

accounts, and is used to measure tourism activity and the importance of tourism to the economy of

a country or region.

The complete Tourism Satellite Account provides6:

Macroeconomic aggregates that describe the size and the direct economic contribution of

tourism, such as tourism direct gross value added (TDGVA) and tourism direct gross domestic

product (TDGDP), consistent with similar aggregates for the total economy and for other

productive economic activities and functional areas of interest;

Detailed data on tourism consumption, a more extended concept associated with the activity

of visitors as consumers, and a description of how this demand is met by domestic supply and

imports, integrated within tables derived from supply and use tables which can be compiled

both at current and constant prices;

Detailed production accounts of the tourism industries, including data on employment,

linkages with other productive economic activities and gross fixed capital formation;

Link between economic data and non-monetary information on tourism, such as number of

trips (or visits), duration of stay, purpose of trip, modes of transport, etc., which is required to

specify the characteristics of the economic variables.

The full TSA consists of a set of 10 tables. Tables 1-3 contain data on the expenditure of visitors

(residents and non-residents), Table 4 contains information on tourism consumption from Tables 1-2

and its components that are difficult to assign to a specific type of tourism, which together make up

the internal tourism consumption. Table 5, in turn, concerns the supply side - it is an account of the

production of tourism goods and services. Table 6 is the basic element of the TSA. It compares tourism

supply with domestic tourism consumption. This table is the basis for calculating the direct

contribution of tourism to the economy, i.e. the gross tourism value of TDGVA and the contribution of

tourism to GDP. Table 7 provides information on employment in tourism activities, and Table 8

presents tourism-related fixed capital formation and collective consumption - these tables are optional

and are not an essential part of the TSA. Table 10 contains selected non-financial measures to facilitate

the interpretation of TSA results.

The result of the TSA compilation is both the possibility of a comprehensive analysis of the impact of

tourism on individual industries, and the possibility of a synthetic assessment of its contribution to the

economy.

6 Tourism Satellite Accounts in Europe 2019 edition report

The current tourism satellite account methodology is used in many countries from all continents.

Since 2010, Eurostat has encouraged EU Member States to submit available national TSA data every

three year, on voluntary basis. The latest data was submitted to Eurostat in 2019 and according to the

Tourism Satellite Accounts in Europe 2019 edition report, it was submitted by 27 European countries

(25 Member States and 2 EFTA countries), and an increasing number of countries preparing TSA proves

their great importance in shaping tourism policy. All partner countries participating in the project are

developing TSAs.

According to the above mentioned report, national methodologies for developing TSA are not

harmonized enough to produce data fully comparable across countries. Comparability and

interpretation of results are influenced by methodological differences between national TSAs and TSA:

RMF 2008, different degrees of completeness of tables, different levels of statistical “maturity” (some

preliminary data, others from pilot projects) and different reference years are reported. However, the

results provide useful information on the state of implementation of TSA and estimates at the EU and

national level on the economic dimension of tourism.

An important element of the TSA methodology and the starting point is the determination of the value

of demand related to tourist consumption. As far as the demand side is concerned, it was assumed

that it consists of the consumption expenditure of tourist traffic participants, i.e. expenditure on goods

and services purchased by or on behalf of visitors in connection with a specific tourist trip. The TSA

does not measure all the economic effects of tourism, but only the effects of the tourism expenditure

incurred during the reporting period, usually during the year. The investment expenditure related to

the development of tourism supply is not analysed in the core part of the TSA and is not included in

the calculation of tourism’s contribution to value added and GDP in the economy.

The basic sources of supply, on the other hand, are mainly tables closely related to national accounts,

i.e. supply and use tables, input-output tables.

Among the sources of data on tourism demand, the most common listed in the report Tourism Satellite

Accounts in Europe 2019 edition are surveys conducted for the purposes of Regulation 692/2011 on

European statistics in the field of accommodation statistics conducted among enterprises and surveys

on tourist demand conducted among households. However, countries use many different sources

available as surveys results, but they do not fully meet TSA's information needs.

One of the objectives of the WPJ was an inventory of previously used and potential sources of data on

tourism, as well as the use of innovative data collection methods and combining data from various

sources in order to improve the completeness and quality of data collected so far.

Currently, obtaining data on tourists and their expenditure is very difficult on the one hand due to,

among other things, mass tourism, open borders, as well as the fact that tourists are a specific group

(fleeting population) - the people to be surveyed are tourists only when they travel. Obtaining reliable

information through surveys is therefore often a complicated and costly undertaking.

On the other hand, the availability of new data sources such as big data, the use of administrative

registers and the use of innovative data collections methods (web scraping) provide potentially great

opportunities to improve the estimates of expenditure incurred by tourists during the trip or to

improve the quality of data on the accommodation establishments and their occupancy by tourists.

The scope of work carried out as part of the WP, of course, did not allow for the development of

a tourism satellite account, taking into account estimates based on newly acquired data, as the work

concerned the improvement of completeness and quality of data relating mainly to the TSA tables

concerning tourism expenditure. Moreover, the data necessary to develop TSA data is much more

detailed than the pilot data collected during the project implementation. In addition, the proposed

methods of collecting and estimating data relate to the years 2019-2020, and those developed by TSA

partner countries most often concern earlier years - the development of TSA is a time-consuming

process and due to the availability of necessary data from national accounts, it is usually delayed

compared to the reference years in tourism statistics.

Table 20. Reference year for the last TSA data

Year Country

2018 NL, PT – estimation

2017 IT , SK , BG

2015 DE , EL

2013 PL source: TSA 2019 edition and web sites information

Nevertheless, based on the results of the work carried out by the WPJ, an attempt was made to assess

the degree of improvement in the data on tourist expenditure incurred by tourists in the reference

country included in Tables 1, 2, 4 and 6 of the TSA.

Attempts to estimate the impact of the newly acquired data on the values in the TSA tables concerning

tourists’ expenditure were made on the example of TSA for Poland. The main reason was that the

testing of the newly proposed method of improving the estimation of the number of trips and the

amount of outbound expenditure of residents was carried out on the results of the survey of the trips

of residents carried out in Poland. This survey is conducted not only for the needs of tourism statistics,

but also for the needs of balance of payments statistics, and its methodology allows for the

identification of detailed items of expenditure (incurred by tourists in the country and abroad).

Therefore, the newly estimated values were also compared to the results of the last version of TSA for

Poland. As indicated above, the last TSA for Poland in full version unfortunately refers to 2013.

Consequently, the estimates of the increase in the value of expenditure obtained thanks to the use of

new methods of obtaining and combining data should be considered only illustrative and intended to

show the scale of the increase in the value of total expenditure (consumption). When estimating the

increase in the value of expenditure, data on tourist accommodation establishments obtained as

a result of scraping portals offering accommodation were used, as well as the results of estimating the

number of outbound trips of Poles to non-European countries (which are insufficiently surveyed) and

the expenditure incurred in Poland during these outbound trips.

To estimate the increase in expenditure of foreigners in Poland and domestic expenditure of Poles,

data on the number of tourist accommodation establishments obtained as a result of web scraping,

taking into account the size of newly identified establishments, were used. Additionally, for tourism

expenditure of Poles, the results of estimating the number of foreign trips with overnight stays of Poles

and related expenditure were also used. Thus, the estimates concerned only tourists using

accommodation.

Table 21. Inbound tourism expenditure (TSA Table 1)

Poland

Expenditure

Total By tourists (overnight visitors ) By same-day visitors

Million EUR

TSA 2013 8 446 4 863 3 583

Project data 8 504 4 921 3 583*

TSA 2013 =100

0.7 1.2

*Not estimated

In the case of foreigners, the new methods of collecting and combining data would increase inbound

tourists’ expenditure by EUR 58 million (i.e. 1.2%) compared to data provided by traditional sources.

Taking into account that foreign tourist expenditure related to overnight visitors accounted for 58.0%

of total foreign tourist expenditure, the latter would increase by 0.7%.

Table 22. Domestic tourism expenditure (TSA Table 2)

Poland

Expenditure

Total By tourists (overnight visitors ) By same-day visitors

Million EUR

TSA 2013 4 380 3 906 474

Project data 4 536 4 062 474*

TSA 2013 =100

3.6 4.0

*Not estimated

The expenditure incurred in the territory of the country by Poles calculated using data obtained by the

solutions proposed in the project was estimated at EUR 4 536 million and was 3.6% higher than the

amount calculated earlier, while the increase in tourists' expenditure related to overnight visitors was

4.0 % (EUR 156 million).

Table 23. Outbound tourism expenditure (TSA Table 3)

Poland Total outbound expenditure

in million EUR

TSA 2013 3 145

Project data 3 515

TSA 2013 = 100

The greatest changes in the estimated tourism expenditure resulting from the application of new

methods were observed in the case of outbound expenditures of Poles. The estimated increase was

EUR 370 million, corresponding to 17.8%. Data showed in this table are still not used in the

development of TSA, but they are a very important source of information from the point of view of

a given economy on the expenditure related to tourism trips.

Table 24. Internal tourism consumption (TSA Table 4)

Poland

Internal

tourism

consumption

Internal

tourism

expenditure

Inbound

tourism

expenditure

Domestic

tourism

expenditure

components

of tourism

consumption

TSA 2013 14 505 12 826 8 446 4 380 1 679

Project data 14 719 13 040 8 504 4 536 1 679*

TSA 2013 =100

1.5 1.7 0.7 3.6

*Not estimated

In 2013, the total tourist consumption in the economic territory of Poland was estimated at

EUR 14 505 million. Almost 90% of consumption was made by tourists, and the major part (66%) was

expenditure of tourists and same-day visitors coming from abroad. Using new sources and methods of

collecting data and maintaining the appropriate structures from 2013, tourism consumption was

estimated at EUR 14 719 million, which would be 1.5% higher.

Table 25. Total domestic supply and tourism ratio (part TSA table 6)

Poland

Tourism gross value

added (at basic prices)

(Million EUR)

Domestic supply (at

purchasers' prices)

(Million EUR)

Internal tourism

consumption

(Million EUR)

Tourism

ratio1

TSA 2013 3 514 954 517 14 505 1.52

Project data 3 584 14 719 1.54

TSA 2013 =100

2.0 1.5 0.02 1 Tourism ratio = Internal tourism consumption (at purchasers' prices) as proportion of domestic supply

It can be seen that the increase in tourist consumption in Poland estimated on the basis of the new

values of tourist expenditure was not substantial enough to cause a significant change in the tourist

coefficient. However, tourism gross value added increased by EUR 70 million.

It should be noted, however, that the calculations are experimental and that additional sources and

new methods of obtaining and estimating data could only be used to a limited extent. Undoubtedly,

they allowed to obtain higher values of estimated tourism expenditure, while in the case of estimating

the impact of tourism on the main macro-aggregates regarding the Polish economy, they did not cause

significant changes. Anyway, Poland is a country where the tourism factor published in the above-

mentioned report (Tourism Satellite Accounts in Europe 2019 edition) was the lowest among the

project partners, so surely the increase in the value of expenditure estimated with the use of the

project achievements and the change in the tourist index will be greater.

The results obtained from these experimental calculations come as no surprise. The time constraints

of the project have now allowed for the preparation of new estimates only for the totals relevant for

the determination of tourism demand in TSA and no significant changes in the total value of tourist

expenditure are expected, even with the continuation and improvement of the proposed methods.

The data collected with the use of the methods and tools used so far are reliable, and the aim is, above

all, to improve their quality and completeness, as well as to identify and measure new, hitherto

unknown phenomena in tourism. One should also not forget about measures to reduce the response

burden.

In the case of this project, however, the project partners emphasize that its development it is

important not only to improve the basic data on tourism statistics as input to the TSA tables, but in

many cases even more important than the total value is the expected improvement in the quality of

estimation of some items related to specific expenditure, difficult to estimate on the basis of data

provided by sample surveys. As an example, Italy gives the underestimation on home rental

expenditure for domestic tourism, but the problem of second homes is increasingly also affecting

inbound tourism. This issue relates not only to the occasional provision of such services in the circle of

family or friends, but also, in an increasingly serious dimension, to services provided under the so-

called collaborative economy. In the near future, it will be possible to receive, via Eurostat, data on

short-term accommodation from four large international platforms, which should significantly improve

the estimates of this phenomenon. Nevertheless, the method proposed in the project for web scraping

on other sites devoted to renting such houses will allow, among other things, to minimize gaps

between tourism statistics and national accounts data currently resulting from the use of sample

survey data. The scope of current web scraping of websites related to tourism covers more and more

issues and will also allow to determine the sizes of other phenomena difficult to estimate but very

relevant in the development of TSA, such as the phenomenon of renting passenger cars without

a driver by tourist or the sales volume of conference services. Above all, however, it will enable a more

accurate estimation of the value of services purchased by tourists in the form of packages

(accommodation, catering, transport, recreational, etc.), which, in accordance with the TSA

methodology, must be disaggregated by the relevant items of tourist expenditure.

3. Implementation of Tourism Integration and Monitoring System

prototype

Deliverable J4 - Technical Report presents the concept of the Tourism Integration and Monitoring

System (TIMS) prototype along with dedicated micro-services, which would support statistical

production in the area of tourism statistics and assist in monitoring changes in the tourism sector.

This document describes all the important components that such a system should contain.

Additionally, this chapter presents an example of the implementation of the main components of the

prototype system developed as part of the WPJ work.

Login screen

It is the first element of user interaction with any system, therefore, it should be designed in a way

that ensures easy and quick access to the application. It often happens that it is overloaded with

options and checkboxes, so when designing the login screen to the system, a few good practices should

be implemented:

distraction-free interface

– the interface should not contain any graphic elements or animations. The best practice is to

use a uniform background and simplify the amount of information required for user input.

easy to fill out form

– the number of fields in the login screen should be limited to the necessary two. In the first

field, it is necessary to enter the login - e-mail address or username, and in the second, to

design the password field so that it is easy to complete. In particular, there should be a built-

in option to show the password while typing it. The option “Forgot your password?”, enabling

the user to recover a lost or forgotten password, should also be added.

clear errors description

– all information about errors that occurred during logging should be clearly presented to

inform the user about the situation. Therefore, phrases like “Login Error” should be avoided

and instead information about the type of error, such as “Invalid e-mail entered” should be

provided.

When programming a system that will be used in various European Union countries, it is worth

remembering to add the option that allows the user to select the language.

An exemplary TIMS login screen is presented in Figure 15.

Figure 15. TIMS login screen

Action centre

An important design element that provides space for the application identity or facilitates access to

actions performed on the system. It allows the user to access alerts and the message centre, which in

a simple and transparent manner inform about the most important operations and events in the

application (see Figure 16).

Figure 16. Action centre

Navigation panel

This element integrates the entire application framework. As a result, fragmented content becomes

full and structured. All the most important functionalities are located in one place, to which the user

has constant access. A well-designed navigation panel allows the user to visualize and clearly present

the possibilities of working in the system. To do this, the core of the application needs to be

emphasized and secondary functions appropriately controlled and hidden. The proposed approach is

presented in Figure 17.

Figure 17. Navigation panel

Dashboard view

It is the main view (see Figure 18) which quickly presents dynamic reports generated on the basis of

data stored in the system. Each user should be able to adjust the presented information to their own

needs and analyses. From the dashboard the user can navigate to details to get more information

about the selected data item.

Figure 18. Dashboard view

Sample views for key system functionalities is presented below.

Web scraping

The first one, the web scraping view (see Figure 19), contains information on scripts for downloading

data from web portals. It allows the user to check the details of the selected script (see Figure 20) and

download its code in the form of text or a file (see Figure 21).

Figure 19. Web scraping view

Details for individual scripts contain basic information regarding the portal name, frequency of data

retrieval, and the date of the last run of the process.

Figure 20. Web scraping details

The code of scripts prepared in any programming language can be presented in text form directly in

the system window (see Figure 21). In addition, it is also possible to download the source code as a file

and save it to a local disk for customization.

Figure 21. Script code view

Micro-services

The J4 report describes exemplary micro-services that allow the addition of new functionalities to the

TIMS system. Each micro-service can be prepared in any programming language, by any country

dealing with tourism statistics and having access to the application. The proposed view (Figure 22)

allows users to see easily the services developed for tourism statistics. It presents all micro-services

available at a given moment along with their description, options and requirements necessary to run.

In addition to these, it also allows to start and execute processes from this point.

Figure 22. Micro-services view

An exemplary interface of a launched micro-service used to obtain geolocation data for

accommodation establishments based on their address data is shown in Figure 23. By default, based

on the user’s location, the map view is presented, with a drop and run window where, it is possible to

add files with data for conversion.

Figure 23. HERE Maps geolocation tool

The exemplary implementation of the TIMS prototype presented in this chapter and described in detail

in deliverable J4, may serve as the basis for the preparation of a system that will be a key element in

creating tourism statistics in all European Union countries.

4. Case studies

Following case studies present the results of individual tasks performed in each partner country. This

chapter consists of 8 subchapters dedicated to the results achieved in the following tasks:

1. Web scraping.

2. Source characteristics.

3. Legal aspects.

4. Combining data.

5. Spatial-temporal disaggregation of data.

6. Flash estimates of the occupancy of accommodation establishments.

7. Methodology to improve the quality of data in various statistical areas.

8. Experimental tourism statistics.

4.1. Web scraping

Use Case Identification

Use Case ID WPJ.1.BG

Use Case Name Web scraping

Date of creation 01.01.2019

Use Case Definition

Description: Applying web scraping method to extract data from Hotels.com for

statistical purposes and testing the Polish software in the Bulgarian

circumstances.

Preconditions: IT expert with knowledge to configure and implement the Polish

software and IT operator for daily execution of the software. The Google

Chrome browser is needed. The subject-matter statistician is needed for

analysing of scraped data with R script provided by the Polish team.

1. Implementing the Polish software:

Installation

Configuration

Testing

2. Daily execution of Polish software for the next day, next weekend and the last Thursday of

month available accommodation in Bulgaria on the Hotels.com portal.

3. Web scraped data storage on the big data sever in the BNSI IT infrastructure.

4. Analysing the scraped data with R script designed by Statistics Poland.

Issues/Exceptions:

There is no Bulgarian version of Hotels.com, therefore the German version was used. The software is

executed manually so data for weekends and holidays are unavailable. Changes on the website of

Hotels.com and time to adjust the software.

Output summary:

CSV files with scraped accommodation data, which are subsequently analysed with the Polish R script.

Estimates on scraped data for Hotels.com:

Prices with density.

Mean and median prices by accommodation type.

Time series with mean and median prices.

Monthly descriptive statistics at national level by accommodation type.

Offers by accommodation types.

Use Case ID WPJ.1.DE-Hesse

Use Case Definition

Description: Examining the possibility of using web scraping techniques to extract

data from internet sites for booking and reviewing touristic

accommodations (e.g. Hotels.com or Booking.com). Hesse used the tools

provided by WPJ (Statistics Poland), but also maintains own tools for the

same or different portals. In total during project term, Hesse scraped

around 10 different portals, of which 5 of them on a monthly basis.

Preconditions: Legal situation of web scraping in general and for official statistics must

be checked (see use cases WPJ.3.DE-Hesse on legal aspects). Depending

on implementation status into production systems, some more or less

reliable and powerful IT infrastructure must be available. For a pilot

study, a rather normally equipped system was sufficient. For production,

more reliable systems are needed.

However, some technical preconditions have to be met:

First, an unrestricted access to the internet is needed (at least,

unrestricted access to specific sites. This has to be checked

carefully: often many different servers and domains have to be

contacted for a website to work properly or as expected).

Second, internal IT security condition have to be checked and

met (e.g. unrestricted access to internet, installation and use of

non-standard programs or applications (a specific internet

browser, running of non-signed java applications, executing

JavaScript code, running non-standard analysis software, e.g. R

or KNIME. This also includes carefully checking and planning of

automatic update procedures for installed software or the

operating system. These procedures may interfere with planned

scraping processes or tools may not work as expected after

updating the executing software).

For running web scraping tools, staff with at least basic knowledge of

HTTP, HTML, CSS, XPATH, data structures (XML, JSON) is needed. For

maintaining tools in different languages (Java, JavaScript, R, KNIME,

Python, …), staff with sound knowledge is needed.

1. Identify eligible sites (in terms of size/coverage of the target population or a specific subgroup

of the target population, e.g. camping grounds, holiday homes, group accommodations, local

coverage …).

2. Identify relevant information from site: what kind of information is presented; for different

purposes or aims, different information is needed: functional information (e.g. business size

(number of rooms or beds offered), prize for staying one night, …) or rather meta data about

a business that is needed for linking micro data with other data sources (name, address,

geolocation) or aggregation.

3. Check terms and conditions, check robots.txt, check sitemap.xml of 1) (see use cases WPJ.3

for legal aspects: WPJ.3.DE-Hesse, WPJ.3.NL on legal aspects).

4. Inspect site and technical accessibility, decide on technology for scraping (i.e. if there is access

to an API; static pages or dynamically generated pages require different approaches).

5. Write script or program to request page, extract information from retrieved page (different

approaches and languages). Hesse used the tools provided by WPJ (Java, JavaScript) as well as

own scripts written for the software R or workflows built in/for the KNIME software.

6. Provide output for further processing or analyses (micro data or aggregated data, e.g. CSV file).

7. (Further processing may include: enriching scraped data by geolocations (pairs of coordinates)

given address (street name, house number, postcode, city name) or, vice versa, “reverse

geocode” geolocation provided on internet sites (resulting in street name, house number,

postcode and city name). WPJ provided a tool for the geocoding task. Alternative tools to

geocode as well as revers geocode have been developed and used by HSL (making use of

openstreetmap.org –i.e. the “Nominatim service” of OSM) However, internally available

official tools that are capable of geocoding as well as reverse geocoding finally have been used

for these tasks.)

Issues/Exceptions:

Hesse has used the tools that have been developed and provided by the Polish colleagues (JavaScript

tool for scraping Hotels.com, Java tool for scraping Booking.com) (but also own tools additionally or if

necessary).

Tool for Hotels.com:

To work properly, the tool needs a list of city or region names and a portal-internal Hotels.com id for

that name. Therefore, in a first separate step (Figure 24), a list had to be compiled separately that

included all 720 unique city name and id combinations for all of the 430 Hessian cities based on the

official register of communities and cities. (Obviously, neither ids nor city names are unique in this list.

This means that there definitely will be some duplicate units after scraping). For a complete coverage,

it turned out to be crucial to include all “duplicate” cities and ids that form a unique combination. (Own

experiments as well as results reported from colleagues from other countries showed that the number

of resulting units after scraping heavily depends on the level (“country”, “county”, “region”, “city”)

that has been chosen for scraping. Maybe, this even is country specific.

Figure 24. Schematic KNIME workflow to identify all valid combinations of city name and portal-internally used ID

The tool ran fine for some weeks. Some days, the access to the site had been blocked (http error 403)

during the scraping process. Since several months, access to the site has been blocked even for the

first http request using the JavaScript file. Using a different IP address or user-agent is not an option

for the HSL. Increasing waiting times between different page requests did not solve the problems -

waiting times already had been already been quite long. Therefore, there only is a rather small daily

time series of available for analysis for Hesse.

Tool for Booking.com:

The tool provided by the Polish colleagues has been used weekly to download all pages for

accommodations in Hesse. First, there have been some problems related to the types of

accommodations available for the target regions that caused the script to break. After all, a weekly run

scrapes all presented accommodations for the target regions (assuming many duplicate results).

A KNIME workflow has been compiled to extract information from the downloaded HTML pages.

However, there has been no progress in using scraped data from Booking.com.

Output summary:

Since there are large gaps in the time-series (due to a temporary/long-term blocking of IP address) for

the mainly used portal, unfortunately there is no meaningful way of presenting and interpreting

results, or produce data that is useful input for temporal or spatial disaggregation and flash estimates.

Therefore, only few results and figures are shown for the output of the WPJ method (daily scrapings

for offers on a specific date; JavaScript file to capture dynamic content) and an independently and

alternatively used method (HSL only; R script; monthly scraping; no specific dates; no dynamic

content). Figure 25 shows the (available) time series that captures the variability in prices.

Unfortunately, there are large gaps in the time-series due to a blocked IP which led to a final stop.

Figure 25. Time-series for accommodation booking portal (WPJ method): mean and median prices

Interestingly, Figure 26 shows less variability (but the overall picture is misleading due to scaling and

the large gaps).

Figure 26. Time-series for accommodation booking portal (WPJ method): number of offers

The alternative approach of scraping the same portal monthly using a different method is not capable

of substituting the daily scraping, since variability in the number of offering units as well as variability

in price is not capture by this method.

In Figure 27, the increasing numbers are not related to real changes in the population but to changes

in the program used for scraping. Additionally, the gaps visible in the middle of Figure 27 are not

related to the corona crisis but due to temporary failures of the program (temporary blocking of IP).

Figure 27. Time-series for accommodation booking portal (alt. method): (monthly) number of unique units

Figure 28 shows that in the alternative method there is only very small variability in prices visible in

the time-series.

Figure 28. Time-series for accommodation booking portal (alt. method): mean and median prices (log scale)

Use Case ID WPJ.1.IT

Use Case Definition

Description: Examining the possibility of using web scraping techniques to extract

data from Booking.com portal.

Preconditions: IT programmer with good knowledge of the web environment. In

particular with knowledge of Java programming language, Selenium

library API, HTML, CSS, and JavaScript.

1. Inspection and analysis of the booking’s site. The input of the municipality as a search destination parameter. Study of the results page for the optimization of the extraction of the hotel output information (name, ID, type, address, Atlas Geo-coordinates). To do this, web developer tools installed in the Firefox browser were used. Search for a solution to solve the limit of 1 000 viewable results by using a more selective search criterion.

2. Design and development of the Java application for the acquisition of structure information. Preparation of the input file containing the list of municipalities to set as destination search criteria. If there are more than 1 000 output results, the application automatically applies more filtering selection criteria (area, types of structure, number of stars).

3. Generation of output files in CSV format. The first file contains the search criteria, the number of

structures obtained, and its search URL useful for a possible restart of the application. The other

file contains structure information such as name, type, identifier, address, Atlas geographic

coordinates, and detail URL that may be useful for a subsequent, more in-depth investigation.

The second file can be used for the following scraping to verify the availability of the structure for

the dates and people set as the search criteria. The application implements a code to manage

the detection of bots.

Issues/Exceptions:

Adapting the software in the case of changes in the structure of the website.

A significant initial effort for adding all the specific search parameters to cover the entire territory.

The preliminary identification and specification of all provinces and municipalities to be searched

is essential to be almost sure that some of the available accommodation data are not missing in

the results.

Output summary:

After an initial attempt on the Hotels.com portal, web scraping was performed on the Booking.com

portal, as it was verified that it contains more structures on the Italian territory. The first web scraping

was carried out in November 2019 at the national level. The high number of structures in the country

required limiting the search to a region considered particularly interesting from a tourist point of view

and which shows a high degree of reliability in the statistical coverage of the census survey: Emilia-

Romagna. Web scraping on this region was performed in two different periods: in November 2019 and

in May 2020.

The number of listings found by web scraping (6,419) corresponds to about half of the structures

contained in the Emilia-Romagna regional archive.

A first analysis on the variable HotelType shows that there is a critical issue regarding the different

strings of the names of the structures, which must be addressed to understand the outcome of the

extraction from Booking. The variety of denominations presented in the regional archive is vast,

however, it probably responds to administrative classification needs. A similar great variety is present

on Booking, which however shows names that are probably more oriented to attract tourists,

therefore more characterized and more descriptive of the type of accommodation. Some

denominations are the same between the two sources (e.g Hotel, bed & breakfast, camping grounds,

etc.), however some categories, for example, other private accommodations, do not appear at all in

the Booking listings for the Emilia-Romagna region, but probably they are included in other types.

On the other hand, accommodations like alpine refuges and hiking refuges are not covered at all and

it seems difficult that they can be traced in other categories. A series of types appear only on Booking.

Among them, chalets, boats, inns, lodges, motels, resorts, villas, homestay accommodations, and only

a more in-depth analysis, based on a match or link procedures between the structures coming from

the two sources, will be able to ascertain whether they are included in other denominations.

Based on the identified categories, it emerges that the potential coverage of Booking on the Emilia-

Romagna region at the time of the scraping concerned all the touristic villages (100%) and almost all

the rented houses (78.7%), and a high percentage of residences (67.7%). Instead, the less visible

categories on Booking in the Emilia-Romagna region were campsites (13.0%), hostels (24.3%), and

rented rooms (29.9%).

Figure 29. Output of web scraping on Booking.com (.CSV file)

Figure 30. Output of web scraping on Booking.com (DB Oracle, with ArcGIS geo coordinate)

Use Case ID WPJ.1.NL

Date of creation 26.07.2019. (Last update 30.09.2020)

Design scope (Sub-)system from which most specifications are a “white-box”7.

Use Case Definition

Description Use scripts (hotels_com.html and hotel_com.bat) to capture, select and save

data on Dutch accommodations to build an Accommodation Enterprises

Database over time.

Preconditions 1. User has some basic knowledge on HTML, CSS and Java.

2. User can modify script hotels_com.html and hotel_com.bat for own

country (e.g. user-agent authorization and working map).

3. User selects the best approach to generate a list of location codes using

the API of Hotels.com.

4. User tunes the time-out specifications to start browser and to fetch pages.

5. User installs Chrome drive. Scripts require the use of Chrome driver and

Chrome browser.

6. IT-facility has to be directly connected to internet and has tool installed to

clean (daily) the facility from cookies, malware or other junk files

connected to web scraping using open internet and Windows OS.

1. Click on command Prompt and change directory to the map containing the scripts

hotels_com.html and hotels_com.bat.

2. Run bat file(s).

3. Check that browser starts.

4. Check if script runs over the list of locations provided for three events (day after, next

weekend and last day of the month).

5. Check if CSV is written in hard-disk.

6. Deploy all CSV files along with a copy of the *.html and *.bat scripts used during web scraping

to production map.

Issues/Exceptions:

1. Web scraper is detected. Run facility cleaner. Restart program on the next location. At the end

of the scraping cycle, try again location that caused errors, based on logging. (If needed, restart

facility.)

2. (Temporal) Changes in the structure of the website. It occurs particularly at the beginning/end

of the month, or eventually, due to a specific event such as changes of country or provinces

policies or legislation (e.g. corona pandemic). Another possible explanation is that the list

location is temporally out-of-order. Try point 1 and/or wait until next day.

7 Three hierarchical categories from Cockburn Design scope (sub)system black-box, (sub)system white-box and component were used. [Cockburn (2001), “Writing effective use cases. Addison-Wesley. ISBN 0-201-70225-8. OCLC 44046973”].

3. (Permanent) Changes in the structure of the website, i.e. use site “nl.Hotels.com” instead of

“Hotels.com” or specifications on variable are changed, e.g. portal Hotels.com allows

accommodations to advertise without requiring their address [fixed the Polish team on

Oct-2019].

Output summary:

Building an Accommodation Enterprises Database over time has been a long expected product for

Statistics Netherlands (SN). In the context of the ESSnet WPJ on Pilot Track Innovative Tourism

Statistics, it has to be acknowledge that this goal has been reached under the leadership of Statistics

Poland.

SN started scraping the portal Hotels.com on July 24th 2019. This version of the script was improved at

the end of July thanks to a suggestion made by the Statistical Office of Hessen. In this manner also the

longitude and latitude coordinates8 of all accommodations advertising in Hotels.com are scraped.

By the end of September 2020, the Accommodation Enterprises Database contains 3.8 million records

over the period 24.07.2019 until 30.09.2020. This time series provides information per

accommodation, namely name, address, postal code, accommodation type, price per night, offer

identification, hotel identification, destination name, date scraping, scraping type, latitude and

longitude coordinates, region, number of guest reviews, (accommodation) star rating and guest

reviews rating.

From the statistical point of view, the most important variable is the accommodation type (accType)

and the derived variable “period open”. These are very important features because they provide a

direct way to gather information on when an accommodation is economically active and how the

accommodation owners (i.e. supply side) classify their accommodation (themselves). Consequently,

SN can compare and match the accommodation owner’s classification to the standard NACE

classification, i.e. NACE 55.1 Hotels and similar accommodation; NACE 55.2 Holiday and other short-

stay accommodation; NACE 55.3 Camping grounds, recreational vehicle parks and trailer parks, or

NACE 55.9 Other accommodation.

Moreover, these web scraping data provide daily insights on whether an accommodation might be

operating in a certain month as a support of the Year Inventory of Accommodation enterprises and to

a certain degree to the Monthly Survey on Number of Guests and Number of Overnights.

The main result is an overview of the variable accType and the number of accommodations using the

portal of Hotels.com over time, see Figure 31. These data enable the analysts to select the most

relevant types of accommodations, i.e. Hotel, Hostel, Apartment and Bed and breakfast and apart-

hotel (NACE 55.1) as the most relevant. Furthermore, the analyst may decide to discard

accommodations of the type “Cabin/Lodge” and “Cottage“, since they appear sporadically.

Another relevant result of the Accommodation Enterprises Database is to detect possible new forms

of accommodations. That is the case of the “Chalets” that appeared soon after the corona pandemic

lock-down. A probable explanation is that those accommodation meet relatively easier the corona

8 See also use case WPJ.4.NL.

policies to control corona spreading: Guests (households) share a common space and keep social

distancing (and minimum contact) respect to accommodation owners and other guests.

These database provides also an overview of how hard the financial impacts are of the corona

pandemic in the accommodation industry (on the accommodations using the portal of Hotels.com.)

Figure 32 shows the cloud of points and the smoothed9 line representing development of

accommodation prices per night. Notice the massive drop of the (mean and median) accommodation

prices after the corona lock-down announcement in the Netherlands after March 15th and that the size

of the corona drop is connected to the accommodation type.

Finally, this database can be used to connect observations on the price behaviour of accommodations

to the time lines of economic phenomena. It allows to identify some turning points probably connected

to policy measures, e.g., the development of the corona pandemic in the Netherlands.

9 Smoothed line is obtained using the loess method (locally estimated scatterplot smoothing with span = 0.25).

Figure 31. Accommodation Enterprises Database: number of accommodations per accommodation type in period 24.07.2019 until 30.09.2020

Figure 32. Accommodation Enterprises Database: Mean (Smoothed) Prices per accommodation type in period 24.07.2019 until 30.09.2020

Figure 33. Accommodation Enterprises Database: Number of offers and (Mean/Median) Prices along with time line of policy measures due to pandemic crisis in the Netherlands

Use Case ID WPJ.1.PT

(Testing began in 2019-08-29, but daily regular scraping began in

2019-10-01)

Use Case Definition

Description: see below

Preconditions: The script for scraping Hotels.com portal is written in JavaScript

programming language and adapting it suit the needs of the users,

namely regarding locations, check-in dates, is very simple. Nevertheless,

the user must have at least basic programming skills (in any language) in

order to do it properly.

Description:

The script for web scraping the portal Hotels.com developed for WPJ granted the opportunity to

experiment on extended data collection from a big data source on a relevant domain such as tourism.

The use of tourism related portals by consumers when making travel arrangements is a widespread

practice, therefore the opportunity to collect and analyse this data is invaluable.

The script does not require a high level of expertise to be tailored to the user’s needs. There are some

notable features about this script:

Easy to adapt and implement;

Returns extensive, relevant and detailed data in a standard format (CSV);

Among other uses, it has significant potential to improve tourist accommodation survey base.

Web scraping of tourism related portals has been regarded as highly promising to expand the resources

available for official statistics. Nevertheless, some concerns have been raised in the past, namely about

ranking of offers and personalised pricing for the travel sector.

A recent report10 (2018) produced under the EU Consumer Programme - “Consumer market study on

online market segmentation through personalised pricing / offer in the European Union” – addressed

this practice of personalized ranking offers and pricing. Although it was not found evidence of general

consistent and systematic personalised pricing, there was some evidence found specifically on the

tourism domain:

Airline and hotel booking websites showed relatively higher evidence of price personalization compared

to websites selling TVs and shoes. - pp. 260

The report also noted that the lack of widespread evidence of personalised pricing should be

interpreted with care since (1) the sample used may not be representative for the EU e-commerce

market as a whole and (2) “online firms may employ any of the latest sophisticated algorithms or

10 Consumer market study on online market segmentation through personalised pricing/offer in the European Union

personalisation tools (such as for example digital fingerprinting) which research tools or methodologies

cannot easily detect” (pp.261).

The problem of personalised ranking is addressed by web scraper for Hotels.com as it collects all the

results available. This is crucial because Hotels.com’s website Terms and Conditions explicitly states in

the “How we display products and services on Hotels.com” section:

“(…) In our default sort of “Our Favourites”, the sort order reflects the relevance of properties to your

search criteria, as we want to make sure you are able to quickly and easily find the offer that is right

for you. We measure relevance by taking into account factors like a property’s location, its review

scores, the popularity of the property (measured by how many travellers on our sites make bookings at

that property), the quality of the content provided by the property, and the competitiveness of the

property’s rates and availability, all relative to other properties meeting your chosen search criteria.

The compensation which a property pays us for bookings made through our sites is also a factor for the

relative ranking of properties with similar offers, based on the relevance factors described above. On

our non-default sorts (e.g., by price or by star rating), properties with similar results will be ordered

based on the factors above.”

This excerpt states the criteria used for the ranking but it is not clear how exactly things are done. Later

on the “Terms and Conditions” section states that the algorithm may be a subject to changes along

time and therefore has a dynamic nature.

“Additionally, Hotels.com continually optimizes our service to provide the best experience to travellers.

Accordingly, we may test different default sort order algorithms from time to time.”

Scraping only the first offers provided by website would return a filtered result with not entirely known

criteria and potentially subject to changes along time. The use of the Hotels.com scraper for an

extended period of time should contribute to minimize the eventual impact of such a ranking and/or

changes in the algorithm. Therefore, long term scraping should be a pre-condition to use this data

source.

Tests were performed in order to have objective criteria to decide the appropriate location list to

scrape. Therefore, data was collected for all the 308 municipalities of Portugal mainland and also

Azores and Madeira archipelagos.

Statistics Portugal performed daily web scraping of Hotels.com portal with the script provided by the

Polish Team. The data collected covers a period from October 2019 to August 2020 when the script

was permanently blocked (Error 403).

The script was scheduled to run at 00:55 every day on a desktop computer and it would take about

2h30-3h00 to collect an average of 13 343 daily offers from 12 191 unique hotelId.

The flow for this use case comprehends, selecting location for web scraping, data collection,

exploratory data analysis and data pre-processing.

1. Selecting location for web scraping

There were several options on how to define the appropriate location search for using the Hotels.com

script. One option could be to select a particularly well known touristic region but they do not always

match administrative regions used in official statistics. Tests were conducted in order to choose the

appropriate list of locationArray to scrape. Three separate hotels_com.html files were prepared and

results compared:

1. The first with locationArray {id: “10233141”, name: “Portugal”}, one search for the entire

country.

2. The second with locationArray with the 18 municipalities integrating Lisbon Metropolitan Area

(NUTS3).

3. Finally with a locationArray for all the 308 municipalities existing in Portugal.

The results of the three files were then compared for locality == “Lisbon” only and revealed different

total number of cases: (1) had 579 cases, (2) had 728 and (3) had 707 cases. The differences found

were not only in the total number of cases for locality == “Lisbon”. When comparing (2) – Lisbon

Metropolitan Area with (3) – PT308 municipalities, the data revealed that there were 27 cases missing

and 48 added. Some hotels were found in Lisbon Metropolitan Area search result (2) but not in the

results for the 308 municipalities (3) and vice-versa. These results seem to imply that the “level” used

for the “locationArray” variable has an impact on what and how many cases are collected.

Not being able to fully understand these discrepancies, a decision was made to scrape the largest

possible amount of data with the most possible detailed granularity. This is particularly relevant if

improvement of touristic accommodation base is intended. Hence, the Hotels.com script was adapted

to collect data from all the 308 existing Portuguese municipalities.

2. Data collection

The minimum number of offers collected at a given day was 5 098 (2020-05-08) and the maximum was

172 103 (2020-03-29). During the period from 2019-10-01 until 2020-08-12 the script was active and

collected over 3.7 million offers being successful for 282 days. The remaining 11% of days, where no

data was collected, are mainly due to changes in the portal that required the script to be adapted or

internal IT maintenance that required the computer, where the script was running, to be rebooted

(mostly during weekends).

The Hotels.com script has mainly three parameters that can be adapted according user needs.

Location (locationArray), which was already previously discussed,

Check-in date (scrapingType): last Thursday of month, next day and next weekend,

Number of guests (adults): 1 or 2 adults and 0 children.

Check-in Date

In the Portuguese use case, the pre-defined parameters for scraping data were a combination of check-

in date and number of adults:

1 adult, last Thursday of month,

1 adult, next day,

2 adults, next weekend.

Table 26 presents a cross tabulation of three check-in dates by the number of adults of the

Booking.com.

Table 26. scrapType by adults

scrapingType 1 2

last Thursday of month 1 731 731 0

next day 1 010 930 0

next weekend 0 1 020 039

As expected, these parameters returned duplicate hotelId. Since the data was predominantly intended

to improve tourist accommodation survey base, duplicate hotelId were removed in the next phase and

12 191 unique accommodations were identified.

Figure 34 represent the total amount of offers collected daily for the three above-mentioned check-in

dates and for the entire period the web scraping was active.

Offers for “next day” and “next weekend” follow a similar trend and have a similar volume. These

figures are very eloquent in demonstrating the impact of the COVID-19 pandemic had in the tourism

domain.

Figure 34. Time series of offers for “next day” and “next weekend”

Offers for “Last Thursday of the month” present extreme high values approximately every 90 days.

Figure 35. Time series of the last Thursday of the month offers

Distribution of offers by check-in date requires some attention. As it can be seen in Figure 35 “Offers:

Last Thursday of the Month”, does not follow the same trend as “next day” or “next weekend”.

Additionally, not every accommodation provided offers for all the three scrapingTypes.

In fact, 5 217 accommodations (42.8%) only had offers for 1 of the 3 requested check-in dates and they

were mostly for “Last Thursday of the Month”.

Table 27. Number of check in options offers by accommodations

Check in dates n percent

1 5 217 42.8

2 1 441 11.8

3 5 533 45.4

Once outliers are removed, “Last Thursday of the Month” time series fit a similar pattern to “Next Day”

and “Next Weekend” as can be confirmed in the Figure 36.

Figure 36. Comparison of time series for different types of offers

3. Exploratory Data Analysis

Web data sources such as Hotels.com portal naturally do not comply with the same strict protocol for

registering data as official statistics. Additionally, it is not entirely known how data was produced and

therefore no assumptions can be made without an adequate exploratory data analysis.

Although, a thorough exploratory data analysis was conducted to the entire data set, the present Use

Case focus only on data (variables) to be used in the use case WPJ.4b.PT.

The data collection returned search results for all the 308 municipalities (location) and for all check-in

dates (scrapingType)/Number of guests (adults). The range of offers by location is very high with

a maximum for Lisbon of 722 400 offers from 3 489 distinct accommodations and with Santa Cruz da

Graciosa with minimum of 400 offers from only 3 accommodations.

Table 28 shows that Lagos and Funchal have a similar number of offers with a very different number

of accommodations (14 985 for the first and 972 for the latter).

Table 28. Top 5 destinations by number of offers, distinct accommodations

destinationName n n_hoteId

Lisboa 722 400 3 486

Porto 380 692 1 601

Lagos 83 291 1 495

Funchal 82 273 972

Albufeira 78 991 2 004

It is important to note that offers by destination may not be exclusively originated by accommodations

from that particular location. Hotels.com “Terms and Conditions” do not guarantee that search results

will include exclusively offers from within the specified location by the user.

Table 28 shows that Funchal having approximately the same amount of offers than Lagos but with less

35% of accommodations. Funchal being in located in Madeira Island may limit the range of offers from

other locations (and other accommodations) that Hotels.com algorithm is able to include in the results.

Table 29. Top Bottom 5 destinations by number of offers, distinct accommodations

destinationName n n_hoteId

Proença-a-Nova 1 535 60

Ilha do Corvo 1 290 5

Santa Cruz das Flores 954 5

Lajes das Flores 817 5

Santa Cruz da Graciosa 400 3

The scraper provides a very useful “region” variable but data is registered in an inconsistent format,

either being in language, abbreviation or reference scope (district, region or area). For example, for

Lisbon related accommodations 9 different designations can be found. Additionally this variable has

71.8% of missing values.

Table 30. Multiple names for Lisbon region

region n percent

LIS 1 873 0.8

lisboa 1 116 0.5

Lisboa 154 297 65.1

Lisbon 75 622 31.9

Lisbon Area 65 0.0

Lisbon District 621 0.3

Lisbon Region 122 0.1

LISBON REGION 135 0.1

Region Lizbony 3 155 1.3

Some inconsistency is also found in addresses where different use of spaces, caps, abbreviation,

language, among other calls for standardization. Table 31 shows examples of nine different ways to

register an address for the same street.

Table 31. Different protocols for address (street)

street

Av. da Liberdade, 247

Avenida Liberdade, 185

Avenida Liberdade 243

Avenida da Liberdade 180 B

Avenida da Liberdade Nº 177 4º Esquerdo

Avenida da Liberdade, n 204, 3 esquerdo

Avenida da Liberdade, 138-142

Avenida da Liberdade, 177 4DTO

Avenida Liberdade 202, 2nd Floor

Furthermore is not easy to determine if an address is complete, that is, to have street, house number,

floor and postal code, because it is not uncommon for an address not to have one of these elements.

For example, the complete address for Statistics Portugal Headquarters does not have a house

number; it is the building’s name that is used for identification. In the data collected by web scraping,

17% (n = 2,078) of hotelId’s do not have a house number in their corresponding address.

The identification variables hotelId and offerId proved to be very robust as they are in fact unique.

As to the case of hotelName, it was expected to find the same hotel for different localities but that

could be easily identified by cross referencing hotelId or offerId. Unexpectedly, there are 5.3% of

hotelId’s that have two or more hotelName’s. Table 32 shows an example of a hotelId associated with

five different names.

Table 32. Example of multiple hotelName’s for the same hotelId and offerId

hotelName hotelId offerId

Joker Guest House 44295775 1418464800

Welcom Lisbon 44295775 1418464800

U INN Lisbon 44295775 1418464800

Joker Guest House Lisbon 44295775 1418464800

Travel Inn Lisbon 44295775 1418464800

Some minor inconsistencies were also found in accommodation names although hotelId proved to be very robust.

Table 33. Example of same hotelId and offerId for different hotelName

hotelName hotelId offerId

A da Avo -The Guesthouse 42101832 1348258624

À da Avó -The Guesthouse 42101832 1348258624

Foreign locations

The search results also included foreign located accommodations, namely from Spain, that accounted

for 6% (n=729). These locations are mainly from areas near the Portugal/Spain and border were

discarded. The examples can be found in the Table 34.

Table 34. Foreign locations returned from search results

hotelName locality street region

Alojamientos Rurales Los Molinos Fuentes de Leon Timoteo Pérez Rubio 16 A Badajoz

Hotel Spa Vilavella A Mezquita Lugar de Vilavella, s/n Ourense

Hospedería El Pico del Fraile Cobreros Carretera De Santa Colomba, 1 Zamora

Hotel Las Bovedas Badajoz Autovía Madrid-Lisboa, km 405.7 Badajoz

Accommodation Type

Accommodation type (accType) is a relevant variable when combining scraped data with survey data.

Almost every case has data in this field, only 0.97% (n=117) have a missing value.

Hotel and Apartments are by far the most common accommodation type found in the data scraped

from Hotels.com portal.

Table 35. Accommodation type frequency table

accType n percent valid_percent

Hotel 1 356 464 36.1 36.5

Apartment 837 897 22.3 22.6

Guest House 451 335 12.0 12.2

Hostel 204 423 5.4 5.5

Country House 183 373 4.9 4.9

Bed and breakfast 179 947 4.8 4.8

Apart-hotel 146 020 3.9 3.9

Vacation home Condo 136 324 3.6 3.7

Villa 74 734 2.0 2.0

Cottage 40 597 1.1 1.1

Other 102 261 2.7 2.7

NA 49 325 1.3 -

Even though these categories are similar to the ones used in official statistics there is no guarantee on

the degree of accuracy they have. For some cases, this variable registers not actually types of

accommodation, but rather information on the services provided, such as “with wifi” or “All inclusive”.

This kind of appropriation of a particular field for registering other information rather than the one it

originally was meant for is also found in hotelNames where long descriptions of the accommodation

can be found, for example: “House With one Bedroom in Águeda, With Wonderful Mountain View,

Pool Access and Enclosed Garden - 55 km From the Beach”.

Based on hotelId it was possible to account for 4.4% (n=541) accommodations with two or more

accommodation types.

Table 36. Accommodations with one or more accType

accType n percent

1 11 650 95.6

2 524 4.3

3 17 0.1

Table 37 presents an example of an accommodation establishment that changed the accType three

times during the period data was collected.

Table 37. Example: Accommodation accType changes along time

hotelId hotelName accType date

2693234 Arts In Hotel Conde Carvalhal Apart-hotel 2019-10-01

2693234 Arts In Hotel Conde Carvalhal Apart-hotel 2020-01-02

2693234 Arts In Hotel Conde Carvalhal Apartment 2020-01-03

2693234 Arts In Hotel Conde Carvalhal Apartment 2020-07-22

2693234 Arts In Hotel Conde Carvalhal Hotel 2020-07-31

2693234 Arts In Hotel Conde Carvalhal Hotel 2020-08-12

This raises concerns when it comes to select a distinct accommodation case that changed one or more

values (either name or accommodation type) along an extended period of data collection. The options

are select the most frequent or just consider the latest update.

4. Data Pre-Processing

The preceding exploratory data analysis was valuable to determine the necessary actions for data

cleaning, pre-processing and standardization. It is of paramount importance and it should be

performed as extensively as wisely as possible in order to what was described in Deliverable J3 -

Methodological framework Report as “make them as similar as possible (thus avoiding False Negatives)

without making them too similar (thus avoiding False Positives)” (pp.21).

Define missing values

Remove duplicates cases

Solve multiple categorization for same case

Pre-process string variables to be used for linking data

o Convert to lowercase

o Remove punctuation and accentuation

o Remove curated stop word list

o Trim whitespace and remove double spaces

o Generalize abbreviations especially for address, eg. praça -> pc or avenida -> av (n=16)

Remove all foreign locations

Issues/Exceptions:

Blocked script

The script for Hotels.com ran from 2019-10-01 until 2020-08-12 without major interruptions. Then the

access to the site was permanently blocked (http error 403) and the script would stop at the very first

location. This was not a general block as other countries continued to use the script successfully. To try

to overcome this problem, some steps were undertaken:

Verification of location Id, and test for just for one location - Lisbon – and confirmed that both

id value and location name were still valid

Changed the name of the first location in the script from;

Tested several other first entries (other locations);

Increased timeout in bat file;

Increased minimum and maximum timeout between fetching next page of Hotels.com;

None of these actions were successful in order to resume the scraping process.

Output summary:

The script developed for WPJ for web scraping the portal Hotels.com collected an average of 13 343

daily offers from 12 191 unique hotelId. Over a time span of 316 days the script collected over

3.7 million offers data for 282 days before it was permanently blocked.

The Hotels.com script was adapted to collect data from all the 308 Portuguese municipalities (Azores

and Madeira Archipelagos included) which is a considerable amount of data.

As expected, data scraped from the web lacks the quality found on official statistics data sources in

a number of ways. There are some inconsistency issues but they’re easily identified. Nevertheless, it is

a valuable source when appropriate data cleaning procedures are undertaken.

Exploratory data analysis gave an important insight on how to proceed regarding data cleaning, pre-

processing and standardization which will be determinant to use case WPJ.4a.PT and WPJ.4b.PT

Use Case ID WPJ.1.SK

Use Case Definition

Description: Web scraping of the Hotels.com and Booking.com portals.

Preconditions: Scraping codes and suitable browsers installed, regular update

necessary.

Scripts provided by PL team were used to scrap data from Hotels.com and Booking.com portals. Both

scripts were tested on the Polish data first. After few adjustments, scripts were launched successfully

on Slovak data. Nevertheless, some problems with both scripts were encountered during the project.

It was decided to web scrap data for entire Slovakia at once and therefore additional information was

linked to web scraped data to be able to analyse them on NUTS3 level at the later stage.

Hotels.com

Data from the Hotels.com portal were scraped from the mid of April 2020. Google Chrome was used

as a browser for web-scraping. Initialization search parameters were set to:

destination: Slovakia; adults= 1; children= 0;

With the use of HERE Maps API tool prepared by the Polish team, variables determining geolocation

were also added to the output file later during the project. Output files with web scraped data were

well structured. Further work was done in order to improve Slovak diacritics marks in output files

because these were not correct after data were web scraped. The web scraping resulted in an error

while running the Hotels.com script few times during the project period. Portal Hotels.com,

e.g. changed structure of offers and street address of the accommodation establishment was no longer

required. This and others problems were solved by the Polish team or other colleagues familiar with

the JavaScript programming. Such issues can hardly be resolved in case there is no support from IT or

people skilled in JavaScript. The web scraped data were cleaned, combined and analysed in the next

steps.

The Polish team also prepared an R code for producing some basic statistics on scraped data, including

plots on offers and prices and histograms. Daily and monthly time series of data can be created, as well

as the database of all unique scraped accommodations throughout the whole period. The code was

revised and some adjustments proposed, mainly related to the speed of the code and some

standardization.

Below are the plots of number of offers (Figure 37), mean and median prices (Figure 38). Histogram of

prices is also displayed (Figure 39).

Figure 37. Number of offers

Figure 38. Mean and median prices

Figure 39. Histogram of prices

Booking.com

Data from the Booking.com portal were scraped from the beginning of April 2020. Firstly, only data for

three major cities representing the west, the middle and the east of Slovakia were web scraped.

Initialization search parameters were set to:

destination: Bratislava, Banska Bystrica, Kosice;

date_from = today;

date_to = today+1;

adults= 1;

children= 0;

rooms= 1;

Later, data for entire Slovakia were web scraped, when destination was changed to “Slovakia”. These

time series cover the period from August 2019 to April 2020. Other searched parameters were not

changed. As mentioned above, some problems were also encountered during running the

Booking.com script. Here are main of our observations:

Different behaviour of the script depending on the way how the script was launched: If it was run

with a direct double-click on the jar file, non-existent types caused breakdowns. If it was run using

command prompt, non-existent types were passed.

The minimum number of pages scraped for a specific type was 2. There were several

accommodation types in Slovakia which contained only few offers (e.g. boats, luxury tents, holiday

complexes). The script did not return results for such cases if parameter “type” was changed one

by one. It was the main reason why it was decided to web scrap data for all accommodation types

at once by using the pipe sign “|” to split them.

Scraping of the whole set of types at once caused stock effect, when the i-th type was scraped

altogether with all previous types, i.e. if the number of the types is n, first type is scraped n-times,

second one (n-1)-times etc.

Maximum 40 pages (or 1 000 offers) were available for Slovak mutation of Booking.com. As data

was web scraped for all Slovakia at once some of available accommodation data were missing (not

an issue for flash estimates and temporal disaggregation, but insufficiency for the survey frame

update).

Existence of discrepancies between the folder name and the accommodation type which should be

stored in that folder. For example, “Hostels” data were stored in the folder “Agritourism farms” or

“Agritourism farms” data were stored in the folder “Holiday homes”. But “Apartments” data were

correctly stored in the folder “Apartments”.

Non-availability of information on address in scraped data is a big disadvantage.

The script stopped working on the 1st of May 2020.

Output files contained raw scraped Booking.com web-pages with complete html code behind. R code

was made for extracting necessary data from scraped files, based on regular expressions. Despite

frequent changes of Booking.com page, main classes and structures of the pages remained more or

less stable, thus it was possible to identify key objects and pull the information from them. After that,

structured data frames of daily files were created and used for further analysis. With the use of parallel

programming in R, it was possible to process a daily file in about 38 seconds (about 700 html files, each

with ca 12 000 rows per day).

Below are some plots of data from Booking.com.

Figure 40. Number of offers

Figure 41. Mean and median prices

Figure 42. Histogram of prices

Issues/Exceptions:

As portals advertising accommodations regularly update their webpages, scrapers should follow their

structure, which can be sometimes a challenging task. Any break in scraping can result in discontinuity

of producing statistics based on these data. Statistical offices should make agreements with these

providers or use their API services, which can be, however, often insufficient in terms of volume and

timeliness.

Output summary:

Within the scope of the project daily scraping of Hotels.com and Booking.com webpages was possible

with the use of scrapers prepared by the Polish team. Some issues were encountered during the

period, most of them resolved by Polish colleagues. Both web pages offered valuable information for

the frame update, flash estimates and temporal disaggregation studies. Stability of the web scraping

is an issue in general, as any change of the web service can lead to the interrupt.

4.2. Source characteristics

Use Case Name Source characteristics

Use Case Definition

Description: Short description of data sources and characteristics.

Preconditions: None

1. Define aims end expected results of using and combining data sources.

2. Identify existing data sources, check underlying definitions and concepts.

3. Check availability / access of 2). Not all data of 2) necessarily is or has to be hold by official

statistics.

4. Check aggregation level of information: micro data vs. aggregated data (e.g. geographical

aggregation: NUTS1, NUTS3, LAU).

5. Check temporal granularity: e.g. data may be available at regular intervals (available biennial,

yearly, quarterly and monthly) or at any point in time (e.g. data from web scraping).

Issues/Exceptions:

Covering time-periods of existing and available data sources maybe do not overlap or overlap only

partly. The temporal aggregation may be different, so temporal disaggregation (and re-aggregation)

may be necessary. Whereas some data sources are up-to-date, others maybe are available only after

a larger time lag (weeks, months but even years). This may require gathering data prospectively long

before it can be used (e.g., generally it is not possible to scrape data from the past in order to compare

it with data sources from the same period that are available after a specific time lag).

Definitions and concepts of different data sources can be different or information on methods,

definitions and concepts is sparse or even missing.

Output summary:

Since there has been a description of sources and characteristics in a previous deliverable, general

results are shown. Figure 43 shows the overall relations between data source types, data sources and

the expected experimental result, based on a tool for visualization developed by WPJ (Task 1C).

Figure 43. Overall network of data sources, data source types, expected outcome within WPJ

The overall picture in Figure 43 shows that many partners in principle share the same data source

(since originated in EU legislation) and therefore share the same expected experimental result. Some

data sources are not available by all partners, but in principle could be available for other partners, too

(e.g. web scraping of specific sites – e.g. of flights and aircraft data). Some data sources exist in

principle, but are not (yet) available. Some data sources are rather country specific (i.e. specific

registers or surveys, or websites targeting a specific target population or covering a specific

geographical region).

The main goal is to produce “innovative tourism statistics”, e.g. “flash estimates”, “spatial and

temporal disaggregation of data on tourist accommodation data base”. Other goals can be achieved

on this way: “Improving tourist accommodation base of reference”, “improve quality of tourists’

expense data”, “improve quality of data on trips”, “improve quality of tourist transport data”.

Figure 43 shows that there is no single data source or data source type that is sufficient for producing

this “innovative tourism statistics”. It is the interplay of different data sources types and data sources

that allows producing innovative statistics. Figure one shows that this is especially web data, survey

data, multi-purpose data (register or administrative data sources).

Figure 44 shows the schematic picture for Hesse.

Figure 44. Schematic network of data sources, data source types, and expected outcome for Hesse (DE)

Especially some of the multi-purpose data as well as survey data from Figure 43 are not available for

Hesse (since these data source types typically contain rather partner specific data sources).

Use Case Definition

Description: Identification of the potentially informative sources for the construction

of an integrated tourism statistic system.

Preconditions: None

Collaborate in the definition of the general structure of the document, suggesting the collection of new

information on each source.

Identify the most popular national and international websites in the tourism sector to choose the best

input for the web scraping procedures.

Make an inventory of data sources related to tourism statistics by selecting those from which data may

be useful to improve the quality and the completeness of official tourism statistics.

Identify, describe, and classify source characteristics taking into account their usefulness for estimating

the demand- and supply-sides of tourism.

Issues/Exceptions:

The real usability of some sources is a critical point. There are periodicity issues due to the production

of only annual data, so it is impossible to know the seasonal peaks. There are difficulties to distinguish

between tourism and non-tourism mobility. Furthermore, some of the sources are too sectorial

sources.

Output summary:

The total number of identified sources for Italy was 14. Also, three international websites were

identified (Booking.com, Hotels.com, and Tripadvisor.com). As concerns the websites, it seems that

national ones are less used so it was decided to put the initial focus of web scraping on the most

popular (Booking.com).

Table 38. Number of data sources identified in Italy broken down by type

Internal External Availability (of external sources) Supply-side Demand-side Both

9 5 2 sources available, and 3 temporarily or

permanently not available 3 9 2

Table 39. Data sources identified in Italy (available)

Data source Frequency Additional information Potential risk involved Level Legislation External/

internal

/middle/

long term

source

production annual

only annual data; so it is

impossible to know the seasonal

NUTS3 (Cities)

External

(National

Institute for

Environmental

Protection)

Long term

consumption annual

only annual data; so it is

impossible to know the seasonal

NUTS3 (Cities) national Internal Long term

Table 40. Data sources identified in Italy (available) (cont.)

internal

/middle/

long term

source

Airport data monthly

number of passengers

(national, international,

scheduled, charter

service)

difficulties in distinguishing

between tourism and non-tourism

mobility

airport

Regulation

437/2003

Regulation

1358/2003,

updated by

Regulation

(EC) n.

1791/2006

internal Long term

Port authority

data monthly number of passengers

mobility

port embarking/

disembarking at

least 200,000

passengers.

Directive

2009/42/EC internal Long term

Railway data quarterly number of passengers

mobility

size of railway

enterprises

Regulation

91/2003

internal Long term

internal

/middle/

long term

source

Occupancy of

tourist

accommodation

establishments

survey

monthly

data on arrivals and

nights spent (by

residents and non-

residents) at tourist

accommodation

establishments in Italy

NUTS3 (Cities)

Regulation

EC (692) on

tourism

statistics

internal Long term

Capacity of

collective

accommodation

establishments

survey

annual

the main structural

information regarding

the accommodation

establishments

NUTS3 (Cities)

Regulation

EC (692) on

tourism

statistics

internal Long term

internal

/middle/

long term

source

Trips and

holidays survey monthly

data on participation of

residents in trips for

personal reasons and

business trips

(domestic and

outbound); provides a

set of information on

tourist trips, such as

destination, booking,

the main purpose, type

of accommodation, the

main means of

transport, duration,

and period of the year

for each trip

(REGIONS)

Regulation

EC (692) on

tourism

statistics

internal Long term

internal

/middle/

long term

source

International

tourism Survey monthly

Resident and non-

resident travellers at

the Italian borders

(road and rail crossings,

international ports and

airports).

(REGIONS)

External (Bank

of Italy) Long term

Survey on rural

tourist

accommodations

annual

Characteristics of

farmhouse

accommodations

NUTS3 (Cities) national internal Long term

Survey on

museums and

similar

institutions

every four

years number of admissions

usable only to estimate this

specific part of cultural tourism NUTS3 (Cities) national internal Long term

Table 44. Data sources identified in Italy (not yet available)

Data source Additional information Potential risk involved

mobile phone data Agreements with mobile phone data providers are in progress, but the authorization of

the Italian Privacy Authority is in progress privacy issues

traffic images data Experimental statistics for measurement of traffic flow on the Italian road network

data on financial transactions They could become available from Workpackage G.

Design scope (Sub-)system from which most specifications are a “white-box”11.

Use Case Definition

Description Use HTML script (visNet_Workflow_v5.html) to get an overview of the Inputs

and Outputs of the Pilot Track Project on Innovative Tourism statistics, case:

The Netherlands. Analyse and evaluate the I/O Source Characteristics in the

Dutch domain. Eventually, use R-script (visNet_Workflow_v5.R) to edit the

HTML output.

Preconditions 1. User has basic knowledge on interactive HTML i.e. user can select filters

for own country.

2. User has basic knowledge on R to load R-libraries and change (read- and

write-) directories.

3. User has installed R (version R-4.0.2) and RStudio with libraries dplyr

(version 1.0.0), visNetwork (visNetwork version 2.0.9) and rstudioapi or

can download then from internet.

1. Open visNet_Workflow_v5.html on any browser by clicking on file.

2. Use “Select by Group” to choose “Country: NL”.

3. Browse over the nodes of interest. Nodes are classified on “Data Source Type”, “Data

Sources”, “Expected Results” and “External links”.

4. Export overview as *.png or make a print-screen for reporting.

Issues/Exceptions:

Check that R, RStudio and libraries are properly installed.

Output summary:

The product “An overview of the Inputs and Outputs of the Pilot Project on Innovative Tourism

Statistics” puts forward a tool to analyse and evaluate simultaneously more than seven Flow Models.

In the current use case, this tool is used to depict the Flow Model of Statistics Netherlands (SN). It is

connected to the Catalogue of Sources of WPJ and the Task 1c on “Source Characteristics” lead by

Statistics Portugal and in cooperation with the Hellenic Statistical Authority and Statistics Netherlands.

11 Three hierarchical categories from Cockburn Design scope (sub)system black-box, (sub)system white-box and component are used. [Cockburn (2001), “Writing effective use cases. Addison-Wesley. ISBN 0-201-70225-8. OCLC 44046973”].

Figure 45 shows that SN has four expected results (Output) out of the Pilot Track WPJ, namely:

1. Improve Tourism Accommodation Database.

2. Assist spatial disaggregation of accommodations.

3. Support the quality of tourists’ expenses.

4. Build on the quality of satellite accounts.

This figure also shows that SN uses two sorts of data types: Survey and Multi-purpose. Furthermore,

there are 12 current (and potential) data sources that are used or are candidates to be used in the

WPJ. The most important data sources are the Register of non-categorised accommodations (CoC), the

Register of Addresses and Buildings (BAG), the Annual Survey on tourist accommodation base, Survey

on participation of residents in trips (CVO), Survey on trips made by foreigners and the portal

Hotels.com. Finally, nine external links are also shown as relevant sites for further research, namely

the Dutch chamber of Commerce, BAG viewer, Hotels.com and dashboard NBTC.

Figure 45. Overview made using visNet_Workflow_v5.R

Use Case ID WPJ.2.PT

Use Case Definition

Description: Variable mapping adds a new layer of information to the previously

developed visNetwork “Overview of the Inputs and Outputs of the Pilot

Project on Innovative Tourism Statistics (v5)”. Based on the R script

previously developed, this new version concerns exclusively Portugal and

the WPJ.4b.PT use case – combining data. This allows a clearly and

intuitively views on the process of using variables.

Preconditions: This use case is preceded by use case’s WPJ.1.PT (web scraping), use case

WPJ.4a.PT (geolocation), availability of data on tourist accommodation

survey base and use case WPJ.4b.PT (combining data). Basic knowledge

of R and RStudio IDE and familiarity with R script visNet_Workflow_v5.R

is also advisable.

1. Identify variables and tool to add to the visNetwork.

2. Edit edges and node accordingly.

3. Edit and run the R Script.

4. Filter data on the html version of the visNetwork.

Issues/Exceptions:

This use case only relates to the specific case of Portugal, although is bares similarities with other

countries that also used the same data sources, tools (geolocation tool) for the expected result:

improve survey population of tourist accommodation establishments.

Output summary:

The resulting visNetwork visually maps the direct and indirect connections between data sources,

variables used, tools and experimental results, establishing the flow of the process. The output is the

R code, the html file and the images (screen captures). The former has the benefit of ease of share the

last two have the benefit of a universal and friendly format.

This output is a very simple proof of concept to illustrate the possibility on how the visNetwork is a

suitable tool to visualize the interconnections between data sources, available variables, tools and

expected results.

Figure 46 shows the complete visNetwork for the Portuguese case, where new nodes (Tool and

Variables) integrated with the previously presented workflow are presented.

Figure 46. Complete visNetwork for Portugal.

In Figure 47 details of the visNetwork on the variables, data sources, tool and experimental results,

combining data from survey and web data in order to improve survey population of tourist

accommodation establishments can be found. The edges (connections) also identify the method used

to link data, 1-1 matching in the case of the coordinates and string matching for Name, Address and

PostalCode.

Figure 47. Details of the visNetwork for Portugal

Annex 5 – R script to visNetwork object WPJ.2.PT contains R script to visNetwork to this use case.

4.3. Legal aspects

Use Case Name Legal aspects

Use Case Definition

Description: Description of legal aspects

Preconditions: None

1. Identify and check relevant general legal requirements and IT security procedures to meet.

2. Consult local legal department and data security and data protection officer.

3. If necessary or optionally, develop a web scraping policy respecting 1) and 2).

4. If necessary, inform relevant positions (e.g. data protection agency), also referring to 3).

5. Implement general or special limitations into scraping software (e.g. user-agent, breaks).

6. Check for reactions of providers. Adapt 5) to reactions from providers.

7. Check for changes in legal requirements. Adapt 5) to changes.

Issues/Exceptions:

Even if web scraping legally may considered legitimate in general, there may be different opinions

whether this mode of data collection should be used (internally, as well as public opinion).

Only in few cases, there may be a legal obligation for service providers to bear scraping by official

statistics. If there is no such obligation of providing a service, providers may exclude any user from the

service at any time (e.g. by blocking user-agents, IP address or a range of ID addresses). If other users

use these IP addresses, they also may be negatively affected.

A strategy for web scraping may include contacting providers (voluntarily or obligatory) in advance.

This may even enable providers to prevent their sites being scraped from the beginning. (Alternatively,

contacting in advance may make scraping possible at all).

Output summary:

In principle, general web scraping is considered legal since these data are generally available to or

accessible by everyone. However, there are some limitations: the technical process of scraping must

not cause damage to the infrastructure of the website provider or data holder and must not hinder

providers or other visitors to the website in their normal use of the website. Technical barriers must

not be overridden automatically. Some offices for official statistics have developed and adopted a web

scraping policy to make their efforts as transparent as possible to the public as well as to website

owners or providers. This policy may include:

Specification of a distinct “user agent” in order to being identifiable as coming from and

belonging to official statistics (no fake user agent, no “spoofing”). This identification may

include a link to a website with more information on the origin of the scraper and the purpose

of scraping.

No “spoofing” of IP address / no hide of IP origin (rotation, VPN) in order to prevent being

blocked.

Planned breaks of a specific number of seconds between subsequent page request to the same

server to lower burden on server and network, but no forced imitation of “human interaction”

(e.g. special emphasize on random waiting times).

Use specific APIs whenever possible to decrease server and network burden.

Contact website owners or providers in advance and announce scraping process as well as

purpose. (In general, there is no need to ask for permission, but otherwise there could be a

permission when scraping would not be possible or allowed otherwise).

(e.g., see https://statistik.hessen.de/ua or see the recommendations of ESSnet Big Data II -

In Germany for some official statistics, basic laws explicitly mention web scraping as the mode of data

collection (e.g. in price statistics). In such cases, there even may be a legal obligation for website

owners or providers to bear web scraping by official statistics.

In principle, specific web scraping projects for official statistics should be conducted after internal

consultation of the legal department as well as the data security and protection officer.

There have been mixed experiences regarding contacting website providers in advance. For most of

the websites, there was no reaction at all (either to the scraping or the announcement). In one case,

there was reaction to the scraping process (resulting in a temporarily blocked IP address). In another

case, there was positive reaction to the advance letter (resulting in the possibility of using an API

instead of scraping) even after scraping became negatively noticeable to the provider.

Use Case Name Legal aspects of web scraping: the case of the Netherlands

Design scope (Sub-)system from which specifications are a “white-box”.12

Use Case Definition

Description Explaining the legal basis applicable in the Netherlands to able to scrape the

portal Hotels.com.

Preconditions The process of web scraping should be transparent, consistent, ethical, and with

respect to all relevant legislation.

Web scraping process should cope with the following questions:

1. Is there a legal basis for NSIs to be able to web scrape?

2. If that is the case, which rules must be adhered to?

3. And how should the web scraped data be stored, processed and published?

Issues/Exceptions:

“Information shall only be made public in such a way that no identifiable information about an

individual person, household, enterprise or institution can be derived from it, unless, there is a valid

reason to assume that the undertaking or institution concerned has no objections to the disclosure.”

Output summary:

The use of big data sources, such as web scraping, is not free of rules in the Netherlands. However,

these rules are not always crystal clear. These rules include European law as well as Dutch law. In fact

three questions need to be answered:

1. Is there a legal basis for NSIs to be able to web scrape?

2. If that is the case, which rules must be adhered to?

3. And how should the web scraped data be stored, processed published?

To start with the first question, generally speaking, web scraping is prohibited by the so called

European directive 96/9/EG (later the Directive on Copyright in the Digital Single Market), especially if

this goes together with the violation of Intellectual property rights. For example, commercially

republishing of web scraped data or building your own index for pages on someone else's website boils

down to whether the operator of the original site “suffers unjustified harm” in the case substantial

investments were made. However, an exception to this rule is provided for scientific and statistical

research, because there is usually no commercial use or goal. Another legal basis that can be used by

Statistics Netherlands to web scrape is found in the Dutch CBS-law. This law is the foundation for the

work that Statistics Netherlands carries out. Especially Article 3 provides the basis why Statistics

Netherlands is allowed to web scrape. Among other things, this article points to ‘the use for statistical

research by government for the purpose of policy, practice and science’, which should also lead to

publication of the results. Finally, one should be aware that if data are web scraped, to say, outside

the country’s borders, the rules in the country apply where the website is hosted.

In addition to the legal basis, it is also important which rules much be adhered to when and how one

is web scraping data from the internet (second question). It can never be the goal to web scrape

unlimited data at any time without any rules. When it comes to the web scraping itself, Statistics

Netherlands has committed itself to the rules described in the ESSnet web scraping policy: ‘the purpose

of this policy is to ensure that web scraping activities of National Statistical Institutes (NSIs) are carried

out transparently, consistently, ethically, and with respect to all relevant legislation; and that web

scraped data are used in an appropriate and ethical manner that limits the burden on website owners

and survey respondents to the greatest extent possible’.13 Important elements of this policy are: follow

the established protocols (scraping bots), be transparent and identify yourself, respect the wishes of

the website owner, minimise the burden on websites and ensure that all web scraped data is securely

stored and processed.

If there is a good legal basis and the data are web scraped according to the proposed rules, it is, finally

(third question), important that the data are stored and processed securely and published correctly.

When publishing web scraped data, as with any other statistical data, one has to adhere at all times to

applicable privacy rules. These privacy rules are mainly regulated in the European General Data

Protection Regulation (GDPR14). The GDPR is designed to give individuals greater control over the ways

in which their personal data is collected, stored, transferred and used. Some principles are:

- Legality: organizations must ensure, among other things, that they have a legal basis to process

personal data, and that they process this data in a fair and transparent manner.

- Limited Use: personal data may only be collected for specific, explicit, legitimate purposes.

- Data minimization: data collection should be limited to only those data that are relevant and

necessary for the intended use.

Beside the protection of personal data (including for example the self-employed), these principles can

be supplemented with Article 37 of the CBS-law. This law states15: ‘information shall only be made

public in such a way that no identifiable information about an individual person, household, enterprise

or institution can be derived from it, unless, in the case of information relating to an enterprise or

institution, there is a valid reason to assume that the undertaking or institution concerned has no

objections to the disclosure.’

13 See the draft: https://webgate.ec.europa.eu/fpfis/mwikis/EssNetbigdata/index.php/WPC_EssNet_Webscraping_policy_draft 14 See: https://gdpr-info.eu/ 15 See https://wetten.overheid.nl/BWBR0015926/2019-01-01 (Wet op het Centraal bureau voor de statistiek; in Dutch)

4.4. Combining data

Use Case ID WPJ.4.EL

Use Case Name Combining data

Use Case Definition

Description: Exploration of the possibility to use data collected from web portals by

means of web scraping techniques in order to enhance existing tourism

statistics. ELSTAT’s efforts are focused on linking the scraped data to the

survey frame of tourist accommodation statistics (NACE Rev.2 55.1

Hotels and similar establishments) aiming at improving it in terms of

completeness and coverage.

Preconditions: For this case study, the web scraper tool (Hotel_com_v3) developed and

deployed by Statistics Poland under the WPJ has been used. Adaptation

of the web scraper tool to national IT environment was required as well

as access to the population frame for hotels and similar establishments

which includes the relevant establishments registered in the Hellenic

Chamber of Hotels (HCH).

1. Pre-processing of the Tourist Accommodation Establishments Register.

2. Pre-processing of the records retrieved from Web scraping.

3. Process of data linking.

4. Evaluation of data linkage.

Issues/Exceptions:

The linking process is hampered by difficulties. Data standardization helps to analyse and use data in

a consistent manner. However, such standardization is difficult for addresses: The plethora of street

names, the existence of streets with more than one name as well as misspellings increase the

complexity of the standardization of addresses.

Output summary:

In the framework of the pilot workpackage, daily web scraping on Hotels.com for the period from

8-8-2019 to 19-9-2019 resulted in 1 110 records with unique establishment id for the NUTS 2 Region

of Attica. These records referred to various types of establishments, while 508 of the afore-mentioned

entries referred to types of accommodation that could be considered relevant to the accommodation

type “hotels and similar establishments” as it is defined for statistical purposes. More specifically, guest

houses, villas, apartments, country houses and other similar establishments were excluded in order to

get the list of 508 entries which were further elaborated.

The retrieved establishments were compared to those in the Register of Tourist Accommodation

Establishments, as updated based on the Register of the Hellenic Chamber of Hotels received in

August 2019 (Table 45).

Table 45. Combining web scraped data with the Register of Tourist Accommodation Establishments

Area Attica

Tourism Register 876 unique establishments (665 active)

Web Scraping Hotels.com 508 establishments with unique id

Matches 343

Non-matches 165

Although inactive hotels are excluded from the survey frame, they were considered for the purposes

of this case study, as their status might have changed to active in the meantime, in case they were

temporary closed or under change of ownership.

In order to match a tourist accommodation establishment from the web data to the statistical Register

the fuzzy lookup Add-In was used. The group of identifiers used for the matching were: the postal code,

the title of the hotel and its address. During the preparation phase of scraped data, the “translit”

function was used for standardizing the data by converting all Greek characters to Latin ones. A more

detailed description and the VBA code are presented in the Annex 8 – Process for data linkage.

The similarity percentage of the related records was assumed on 50%, i.e. when result was below 50%,

it was considered as “non-match”, while when it was 50% or above, it was considered as “match”.

Hence, 343 establishments were matched (similarity ranging from 50% to 100%), while 165

establishments were not matched (Table 46).

Table 46. Matched records - results

Similarity percentage range

Number of records

True Positive

False Positive

Matched establishment flagged as inactive in the Register

Cannot be resolved

50%-59% 22 19 2 1

60%-69% 33 29 1 2 1

70%-79% 57 55 2

80%-89% 54 51 1 2

90%-100% 177 174 3

Total units 343 328 4 10 1

Quality of the data linkage process for the resulted 343 matched records were manually cross-checked

(title, location, address) to resolve if it is a True Positive or a False Positive match. The results of the

evaluation process are summarized in the form of a confusion matrix, which is often used when

evaluating the performance of a classification model.

The database of establishments created by data linkage of the web scraped and statistical frame data

is the final output of the methodology for the data linkage process.

It was anticipated that matched establishments with similarity percentage below 90% would be mostly

False Positive. However, it turned out that for most of such cases either the address was spelled

differently in the two datasets or the title of the establishment in the web portal included the hotel

chain name as well, i.e. they were True Positive matches. Moreover, in a few cases the address was

missing in the Register, and in a couple of cases the title of the establishment had changed but the

establishment was still registered under the old title in the Register.

Finally, analysis showed that 328 of the 343 matches are True Positive, while 4 are False Positive,

10 seem to be True Positive, but the matched establishment in the Register is flagged inactive, and

1 case could not be resolved.

Additionally, 25 of the 165 non-matches turned out to be False Negative, 4 cases could not be resolved,

while the rest 136 cases are considered presumable as True Negative since no establishment with

a similar title or address occurs in the Register.

Table 47. Confusion Matrix

To sum up, the data linkage of web scraped data for Attica with the relevant hotels in the Register of

Tourist Accommodation establishments, resulted in 493 resolved cases out of 508 cases in total

(97.0%), while 328 out of the 343 establishments that occur in both the web scraping and the Register

were identified (95.6%).

Main conclusions:

• For a significant number of hotels, web scraping can be used to identify the months

of operation of the accommodation, which can be used for assessing or estimating the monthly

occupancy.

• Web scraping could be used to identify latest updates in the Register (new establishments)

before they are made available in administrative data sources.

• In urban areas, the postal code, the address and the title of the accommodation seem to give

information that can be used effectively for matching establishments from web scraping with

those in the Register.

ACTUAL

TOTAL (all

establishments in portals)

MATCH (establishment in web scraping and Tourism

Accommodation Register)

NON MATCH (establishment present in

web scraping only) Total

MATCH true matches True Positives (TP)

328 false matches False Positives (FP)

NON MATCH false non-matches False Negatives (FN)

25 true non-matches True Negatives (TN)

136 161

Total 353 140 493

• The NACE classification of the establishments in the web scraping file may not be covered

correctly. Several accommodations that are registered as hotels on the platform, are in fact

other types of accommodation establishments, according to statistical definitions.

Use Case Definition

Description: A brief description of steps to take for combination of data sources.

Preconditions: None

1. Specify goal to achieve with data linkage

a. Identify missing units from one data source

b. Join data sources to enrich information on units present in both data sources

Different goals may require different sources or different matching strategies.

2. Specify costs of misclassifications, i.e. wrong combinations of units from different data

sources. This refers to consequences of false positive combinations or false negative

combinations. Depending on the goal and / or the specific application, costs of misclassification

can be very different.

3. Identify data sources that are available to reach goals from 1). Check definitions and

specifications of the data sources. Specify expected outcome, e.g. overlap of data sources.

4. Identify matching keys needed for combination of data sources from 3). Typically, data sources

at least need some information in common that allow combination on unit level. These may be:

a. unique identifiers (e.g. ID number, tax number, register numbers)

b. non-unique identifiers (names, address information (street name, house number,

postcode, city name).

5. Assess quality of matching keys from 4): do they contain a lot of errors (e.g. typographical

errors, different spellings, missing values, digit transpositions)?

6. Decide on matching strategy, e.g. to combine data sources from 3) using matching keys

from 4): e.g.

a. deterministic matching

b. probabilistic matching.

Matching strategies are not necessarily exclusive. Depending on the specific application,

different strategies may be used jointly or subsequently to combine data sources.

7. After combining data sources, assess quality of the resulting combination manually e.g. on a

subsample. Do results match expectations from 3)? Assess results in the light of 2).

8. Assess the following steps (e.g. further analyses) on robustness for erroneous combinations of

the data sources.

Issues/Exceptions:

Data sources may not available or accessible. There may be legal obstacles for the combination of data

at all or on using the matching keys. There may be no common matching keys or matching keys may

be of poor quality. Combination of data may produce too much false combinations or even few

misclassifications are too costly. Combining data may introduce systematic linkage bias when units

correctly combined (true positives) differ systematically from units that cannot be combined (false

negatives).

Output summary: See deliverable J3 which includes the case study from DE-Hesse.

Use Case Definition

Description: Examining the possibility of matching different sources of data: web

scraping and administrative data to complete and enrich information on

Preconditions: IT programmer with knowledge of matching instruments and functions.

1. The Directorate for environmental and territorial statistics of ISTAT provided the register of the

accommodation establishments of the Emilia-Romagna Region. This regional archive was

identified as one of the most suitable for the scope, having a good quality in terms of

exhaustiveness and completeness of the information, the complete coincidence between the

register and the results of the Capacity survey, territorial coverage, and good sensitivity of the

regional contacts. Furthermore, Emilia-Romagna is a relevant area of the country in terms of

tourism (for both establishments and tourist flows). The Region has provided this archive to ISTAT

only since 2018, as a consequence of the updating of the national statistical program.

2. Since this was the first time this archive was available, some issues which needed to be clarified

occurred, for instance, the difference between the number of records in the register and the

number of accommodation establishments, because, for the rental accommodation managed as

enterprises, the number of establishments is greater than 1. What is a unique record in the register

(single management of several lodgings) refers to more accommodation establishments in the

area. Moreover, the classification of the “open-air” accommodations and the difference between

the three typologies (campsites, tourist villages and campsites and tourist villages in the mixed

form) was clarified. There are 25 683 establishments in the regional register, of which 9,733

establishments were type C2 “Other private accommodation establishments” (not disseminated

on the ISTAT Data warehouse). For this reason, there are 15 950 accommodation establishments

in the register, referring to 11,526 different records, which represent 7.4% of Italian structures.

3. Some preparatory operations were carried out for the standardization of the text strings, on both

datasets, especially on the name of the hotel, to make them comparable. In particular, isolation

of the name of the structure from the classification attributes (hotel, B&B, others); the breakdown

of the address to the highest possible level of detail (the street, street number, postal code, and

municipality); the elimination of duplicates (the same string of name-province, city, address). The

city and the postal code were extrapolated with a regular expression from the formatted address,

obtained with ArcGIS API. Two different workflows were carried out starting from here, as

described in the following points 4-9 and 10-13.

4. HERE Maps Geographic tool was used for the location (only for scraped structures). The input

expects the following list of fields: UNIT-ID, NAME, CITY, ADDRESS, and POSTAL CODE. The most

relevant output fields of HERE Maps are POSTAL CODE, STATE, STREET, HOUSENUMBER, LATITUDE

and LONGITUDE unpacked on as many different information fields. HERE Maps location was

successful on the web scraping archive in 86.6% of structures, being the other cases False Negative

(of which the majority was null).

5. The advantage of having more details in the fields allows us to diversify the types of combinations

and to be able to proceed for subsequent steps by loosening the constraints on the conditions to

obtain a larger number of matches.

6. A unique code has been added to the administrative data archive because the fiscal code is

sometimes duplicated (for units belonging to the same owner); in other cases, it is absent. The

same process was performed on the scraped data.

7. The first comparison was an exact match between two strings on hotel name and postal code. For

them, a matching score of 100% was considered.

8. In the next attempt of comparison Jaro-Winkler similarity was tried, namely a measure of similarity

between two strings. The Jaro measure is the weighted sum of the percentage of matched

characters from each file and transposed characters. Winkler increased this measure for matching

the initial words and then rescaled it by a piecewise function, whose intervals and weights depend

on the type of string (first name, last name, street, and others).

9. Dozens of tests were carried out, without still finding the perfect matching score. The next steps

will be:

repeating the process on the most up-to-date archives;

using ArcGIS software on both sources to format the addresses and extrapolate address,

municipality, postcode, the province on multiple fields (as with HERE Maps);

new matching with Jaro-Winkler storing the results with different scores.

Issues/Exceptions:

HERE Maps tool was also tried to obtain the location of the structures in the regional archive,

but the results were not satisfactory as 40% of cases were False Negative on the province, of

which the majority was null.

Finding the best matching score: if too high, namely more restrictive, the risk is the loss of

information, but if it is too low the wrong data could be compared.

Output summary:

Table 48. Results of matching on 2018/2019 Emilia-Romagna administrative archive (HERE Maps geo-

localization of structures scraped on Booking.com)

Feature Emilia-Romagna Archive Booking.com on Emilia-Romagna

Reference period 2018 19.11.2019/03.12.2019

Survey Population 11 526 structures on 328

municipalities

5 670 on 315 municipalities

(*only 4 109 mapped with HERE Maps)

1-1 matching (Linkage Hotel

name and postal code) 607 (5,2%)(a)

1-1 matching (Linkage Hotel

name, HERE Maps address

with Jaro-Winkler)

Score Jaro-Winkler Hotel name

Score Jaro-Winkler Address

1-1 matching %(a)

80 80 1025 8.9

85 85 242 2.1

(a) On 11 526 records

1. ArcGIS was used for the location of structures in both sources. The software returns the formatted

address, the coordinates in terms of latitude-longitude (based on it), and a score of the calculated

coordinates, useful by selecting only the coordinates with the highest score, considered the most

reliable ones.

2. Several attempts of linkage were made: by hotel name and municipality; by formatted hotel name

(upper case, removal of spaces, removal of words classifying the structures like a hotel, room,

camping, farmhouse) and municipality; by formatted address (for multiple ones it matches the

structure name with a tolerance of 1 character); by latitude and longitude coordinates (even in this

case, if there are many structures it matches the structure name with a tolerance of 1). In the latter

attempt, many different distances of tolerance were tried (10, 20, and 30 meters). Every attempt

was numbered and stored in a separate column of the database to trace it and identify the

successful ones for each structure.

3. Evaluation of the candidate's score matching the address. It can be in a range of 0 to 100, in which

100 indicates the candidate is a perfect match.

4. Check of the matchings: some territories were not well covered (under-coverage) by the web

scraping results, so a geographical analysis of the areas covered by web scraping with GIS software

was carried out. Moreover, types of accommodation included and excluded from web scraping

were identified to detect the not satisfactory matches. As a result, three lists of municipalities were

defined to allow the IT experts to refine the criteria for carrying out web scraping. The first list with

low matching coverage, namely with the greater distances between the regional register (higher

values) and the web scraping results (n. 41 municipalities). In the list, there are provincial capitals

(Parma, Piacenza, Reggio Emilia, Modena, Bologna, Ferrara, Ravenna, Rimini) and highly receptive

municipalities (Salsomaggiore Terme, Comacchio, Cervia, Cesenatico, Gatteo, Cattolica, Riccione).

The second list contains those municipalities absent in the web scraping output, although they

have establishments in the regional register (n. 165). In this list, there are also municipalities with

many establishments, as, for instance, Misano Adriatico with 250 units; the third list with the ones

that are present in web scraping but not in the regional register.

Issues/Exceptions:

Cleaning, standardization, formatting address data may be very complex and time-demanding.

The choice of the optimal set of information to carry out the match (matching type) can be difficult.

Finding the best matching score: if too high, namely more restrictive, the risk is the loss of

information, but if it is too low wrong data would be compared. In particular, some cases with a

less precise address, namely countryside areas where farmhouses have not a house number, need

more approximation during the match.

Considering the attempts that maximize the number of matched structures (4 270, referring to the

year 2019) gives a percentage of 1-1 matching of about 33%. Furthermore, further analysis is

needed to count the False Negative and False Positive matches.

Need to obtain data from other portals (international or national ones) to understand the

incidence of the structures present on the web compared with the total.

Output summary:

Table 49. Matchings attempts on 2018/2019 Emilia-Romagna administrative archive

No. Matching type

ArcGis

Number of matched

structures

2018 2019

1 Linkage Hotel name and municipality - 648 1 265

2 Linkage Hotel name formatted and municipality - 1 738 2 715

3 Linkage formatted address 90 2 665 3 611

80 2 968 3 698

70 3 001 3 699

4 Linkage Geo Latitude, Longitude distance

10 meters

90 2 927 3 711

80 2 979 3 819

70 3 004 3 820

20 meters

90 - 3 922

80 - 4 029

70 - 4 030

30 meters

90 - 4 156

80 - 4 269

70 - 4 270

Table 50. Results of matching on 2018/2019 Emilia-Romagna administrative archive (ArcGIS geo-localization of both regional archive and structures scraped on Booking.com)

Feature Emilia-Romagna Archive Booking.com on Emilia-Romagna

Reference period 2018 19.11.2019/03.12.2019

municipalities 5 670 on 315 municipalities

1-1 matching 3 004 (26.0%) (a)

Reference period 2019 29.05.2020

municipalities 6 419 on 315 municipalities

1-1 matching 4 270 (33.1%) (b)

(a) On 11 526 records

(b) On 12 877 records

Expected results:

Define the degree of coverage of the Region’s registers and, indirectly, of the Capacity survey

(identifying, for instance, accommodation establishments present only on the web and not in

the archive register). This process will be useful concerning hotels but, above all, to the arising

sector of the “other collective accommodation” and in particular to “holiday dwellings” (the

so-called accommodation managed as enterprises, a typology with smaller defined contours

than the others) and “private accommodations” to make statistics on these missing

establishments, in terms of type, location, and other attributes.

Analyse the quality of information on the accommodation establishments provided by the

Regions; integrate the official statistics (namely those already required by the European

Regulation on tourism statistics/PSN) and collect new variables (like, for instance,

establishments and geographic coordinates) that are present on the web; make statistics on

linked establishments.

Use case Name Combining data (Geo tool)

Design scope Component from which most specifications are a “black-box” 16.

Use Case Definition

Description Using script geo2.html to convert (map forward and backward) addresses of

accommodations into location data (latitude and longitude coordinates) and to

save the output into (ouputfileWS.csv) to match with Dutch Accommodation

Enterprises Database (based on Hotels.com) over time.

Preconditions 1. User needs some basic knowledge on Excel and CSV files to prepare input

file Establishment by type of accommodation.csv.

2. Developer of geolocation tool requires knowledge on database HERE.

3. Script requires the use of Chrome driver/browser.

4. Script does not require any knowledge on HTML, CSS or Java.

1. Click on geo2.html.

2. Check that browser starts.

3. Choose file Establishment by type of accommodation.csv or drop spreadsheet into square

drawn in html. (Message “Starting geolocation for address” will appear).

4. Wait until script runs over the list of addresses provided in the CSV file. (Message: “Finished

processing all addresses”).

5. Location tool will plot green tags ().

6. Check if ouputfileWS.csv is written in hard-disk.

7. Deploy CSV file along with a copy of the geo2.html used during geolocation.

Issues/Exceptions:

1. Geolocation process does not start. Check if Chrome-driver is properly installed.

2. Message “Starting geolocation for address” remains long “on” without saving CSV file.

(Depending on the length of the CSV, the process can take easily a half an hour for two or

three thousand records).

3. Control if geolocation tool output is plotted, i.e. green symbols provide the found

geolocations.

4. If address or postal code are (partially) incomplete, the database delivers outcomes that have

to be checked.

5. If address or postal code are empty, the result is an empty cell.

16 Three hierarchical categories from Cockburn Design scope (sub) system black-box, (sub)system white-box and component are used. [(Cockburn, 2001) “Writing effective use cases. Addison-Wesley. ISBN 0-201-70225-8. OCLC 44046973”].

Output summary:

The geolocation tool developed by Statistics Poland provides a very robust tool to find the locations of

large sets of addresses. It was used in the context of Tourism Statistics. But, it can be applied easily for

other purposes, e.g. Museum or Sport Statistics and even more generally to foresee enterprises data

(e.g. Chamber of Commerce) of geolocations. The steps to upload and download data in the

geolocation tool are easy and are depicted in Figure 48.

Last, it was found out that the geolocations provided by the geolocation tool using the database HERE

are proxies of the data available in the Register of Addresses and Building of the Netherlands (BAG).

The BAG provides the geolocation of the building, whereas the geolocation tool (and Hotels.com)

provides the coordinates of the (adjacent) street, as reported in the deliverable J3 of WPJ.

Figure 48. Impression about the upload and download address information in geolocation tool

Step 1

Step 2

Step 3

Step 4

Use Case ID WPJ.4.PL

Use Case Definition

Description: Methods of combining data from surveys of the occupancy of

accommodation establishments and web scraping of accommodation

portals based on address data and geographical coordinates.

Preconditions: Programming skills in R language and knowledge of text mining

techniques.

1. Preparation of two separate databases of accommodation establishments containing address

variables (postal code, city, street, apartment number, etc.) on the basis of the database of

accommodation establishments obtained from the accommodation portal and the database

of the survey population of the tourist accommodation establishments.

2. Assigning geographic coordinates using the HERE Maps API tool for the records from both

databases.

In a situation where it is not possible to obtain geographic coordinates for establishments from

the accommodation portal using the above-mentioned tool (due to an incorrect address or its

lack), the coordinates given in the offer on this portal are assigned - if available.

3. Determining the value of the critical distance - the maximum distance that will be a

determinant for the pairing of establishments.

4. Calculation of the distance matrix using the Haversine or Vincent formula on the basis of

previously obtained geographic coordinates.

5. Pairing the establishments in accordance with the principle that each establishment from the

accommodation portal is assigned to the closest establishment from the statistical surveys,

which is also below the established critical distance.

Issues/Exceptions:

Combining establishments from the population frame of the survey of tourist accommodation

establishments with establishments obtained as a result of the web scraping process of

accommodation portals, based solely on geographic coordinates and distances, is characterized by

a high percentage of incorrectly paired establishments among all paired ones. It is therefore legitimate

to use, in addition to geographical proximity, other available information as an additional matching

criterion.

Output summary:

To combine data on accommodation establishments, population frame of the survey of tourist

accommodation establishments held by official statistics and data obtained from web scraping of the

Hotels.com portal were used. The population frame contained information on 16 407 establishments,

while the web scraping data set included 3 742 accommodation establishments. The geographical

coordinates for both datasets were generated using the HERE MAPS API tool, although for most of the

establishments from the second set they were available directly on the Hotels.com portal. This

approach ensured adequate comparability of the resulting coordinates. The pairing was made on the

basis of Vincent’s formula. As a result, 1 719 establishments were paired. On the basis of the error

matrix, a number of indicators were determined. Their definitions and symbols are given in Annex 1 –

Quality indicators of data matching. 94% of the establishments were correctly paired (TP = 774) or

correctly unpaired (TN = 16561). Such high accuracy is the result of a very high value of the second

component (TN). 2 023 establishments obtained from the Hotels.com portal and 14 688

establishments from the survey population were not paired at all.

Table 51. Error matrix for pairing of accommodation establishments

Reality

Pairing No pairing

Prediction

Pairing TP=774 FP=945

No pairing FN=151 TN=16 561

In a situation where the population is unbalanced, other indicators should be used to analyse the

relationship for TP and TN separately. Additionally, it is suggested that the matching method should

have high sensitivity and precision. Then, with high frequency, the paired establishment will be paired

correctly, and the unpaired establishment will be correctly unpaired.

Subsequently, the remaining indicators were determined. The sensitivity index is high and was 0.836,

which means that 83.6% of the establishments that should be paired are paired correctly. On the other

hand, the precision ratio has not reached a high value and was 0.45, which means that out of 925

paired establishments only 45% have been paired correctly. The measure that connects these two

indices (Sensitivity Index and Precision Index) is F1, which is their harmonic mean. F1 was 0.585, which

can be considered a rather low score. The F1 indicator focuses on paired establishments and does not

draw any information about properly unpaired establishments at all. To take this into account, the

Youden index, calculated as the sum of the sensitivity and specificity index minus one, is used. The

Youden index was 0.782 which is quite a high score. The negative prediction value was 0.991, which

means that correctly unpaired establishments were a very large group, among all unpaired

establishments. Moreover, the positive and negative probability factor and a diagnostic chance

quotient were calculated, which were respectively 15.5 (the quotient of TPR and FPR), 0.173

(the quotient of FNR and TNR) and 89.9 (quotient of the positive and negative probability factor).

The experience gained allows to conclude that the pairing of establishments based on geographic

coordinates is a very quick method that does not require standardization of data, but has some

imperfections that one should be aware of. The use of additional information about establishments,

e.g. their names, should be an additional criterion for pairing establishments.

Use Case ID WPJ.4a.PT

Use Case Name Combining data (geolocation)

Use Case Definition

Description: Geo locating data from Hotels.com portal and official statistics survey

adds a new important layer of information that can be used, for example,

to link and combine both data sources. Since a key variable does not exist

to link both of them, using a tool to generate coordinates could provide

a viable alternative. Geolocation tool uses address data to attribute

coordinates (latitude and longitude) and that data can be found in both

data sources.

Preconditions: To use the geolocation tool it is necessary to have the output files from

web scraping for the Hotels.com portal. This file does not require any

modification. As for the survey file, it must comply with a simple

structure: variables have to be renamed with specific designations:

unitid, name, city, address and postalCode. Chrome web browser must

be installed.

Geolocation tool has a very friendly user interface and allows for drag and drop for one file at a time.

Once the process is complete, the “generate output file” button becomes enabled and by clicking it

produces a CSV file with the results. The process is identical for files from both data sources.

Issues/Exceptions:

Geocoding data from Hotels.com ran without any issues but there was a problem with survey data

(XLSX format). The geolocation tool would crash after some time due to an out of memory related

error. Although the file had a considerable size (1,97MB; 9 341 cases and 20 variables), neither its size

nor the geocoding tool were probably the cause of the crash.

The file original was split in multiple files and the geolocation tool would work on some and would

crash on others. The solution found to overcome this issue was to process the survey data file with

synthetic data so that would become identical in structure and format with the data from web scraping

process (CSV format). In this synthetic file the only real data was the Id and the address provided by

the survey data file. It was then processed without any issues with the geolocation tool.

Output summary:

The geolocation tool provided latitude and longitude for 63.5% (n=7 743) of the scraped file. For the

remaining cases missing coordinates, Openstreetmap Nominatin was used to complement that

information. This approach only geo located about 20% of that missing data. Since these were not

encouraging results and Nominatim usage policy is somewhat restrict, this approach was abandoned

and its results discarded for the present use case.

As for the survey data file, the geolocation tool provided coordinates for 86.3% (n=8 092). The

difference for the percentage of coordinates attributed to the cases for both data sources is most likely

to be related to the quality of the data itself.

Use Case ID WPJ.4b.PT

Use Case Definition

Description: Scraped data from Hotels.com portal has a high potential to improve

survey accommodation base used for official statistics. Available data for

matching is already present in both data sources, namely name of

accommodations, address and postal code. Additional data (coordinates)

was added using geo tool. Methods chosen to link data were matching

1-1 and string distance matching.

Preconditions: This use case is preceded by use case WPJ.1.PT (web scraping) and use

case WPJ.4a.PT (geolocation) as well the availability of a survey data base

on tourist accommodations.

The flow for this use case comprehends prioritization of data, linking data, matching string of distance

and coordinates.

1. Prioritize reference data

Data from Hotels.com was collected for the 308 municipalities found in Portugal mainland and Madeira

and Azores Archipelagos. These data collection criteria returned an extensive amount of data. In order

to better control the results of combining data from web scraping and survey data, both data sources

were prioritized by common criteria of accommodation type and territorial scope.

Hotels.com data was filtered using accommodation type: “Hotel”, “Apart-Hotel” and “Pousada”.

Survey data was also filtered with equivalent criteria. For this purpose, CAE-Rev.3 classification

(national classification for economic activities) that is NACE-Rev.2 equivalent up to four digits with an

added fifth digit for increased detail and adaptability to the national context, was used. The selected

accommodations were classified by the following CAE-Rev3 codes:

55111 Hotel with restaurant

55121 Hotel without restaurant

55114 Pousadas

55116 Apart-Hotel with restaurant

As for territorial scope, data concerns only Portugal Mainland (both Madeira and Azores Archipelagos

being excluded from the present case).

As a result, web scraped data from Hotels.com was reduced to 1 856 distinct hotelId and survey data

to 1 474 accommodations. These subset of data sources will be referred to as reference data as

opposing to complete data.

2. Linking web scraped data to survey data by Name and Postal Code and Locality

Approximate string matching of data from Hotels.com and survey data was performed with preceding

a basic text mining approach. Accommodation names and addresses were standardized and a curated

stop words list was used. An additional step was then undertaken and addresses were also

standardized with a special focus on name places and common abbreviations. The importance of data

cleaning and standardization for such a task as this one of combining data sources cannot be stressed

enough. Naturally, the same procedures were applied in both data sources.

For string matching, a string distance that computes the number of operations such as substitution,

deletion, insertion or transposition of characters needed to turn one string in another was used. In this

case Optimal String Alignment distance (OSA) was chosen which is similar to the Levenshtein distance

but allows transposition of adjacent characters. Each substring may be edited only once. This method

is available in Mark P.J. van der Loo’s strindist R package.

Matching by string distance was then performed by concatenating accommodation name, postal code

and locality. Duplicated strings were removed if any resulted from this concatenation. Results up to

a string distance of 3 OSA are presented in Table 52.

Table 52. Match by name, postal code and locality (OSA <= 3)

osa n percent

0 458 72.2

1 70 11.0

2 52 8.2

3 54 8.5

Total 634 100.0

Exact matching (OSA = 0) accounts for 458 cases which is about 25% of the 1 856 cases available for

Hotels.com reference data. Matching by string distance added another 176 cases. All the 634 cases

were then manually validated cross referencing complete available address and geolocation data. This

validation confirmed that all the 634 cases were True Positives (TP).

In conclusion, string distance linkage from Hotels.com data to survey data, using Name and complete

Postal Code and Locality with a threshold of up to 3 OSA allows a 34.2% match and a precision of 117.

3. Linking web scraped data to survey data by Name, Postal Code (4 digits) and Locality

Postal codes are updated with some regularity and new ones are created, hence there are examples

of outdated postal codes that were not updated in some sources (Hotels.com, e.g.). Complete postal

code is composed by a series of 4 plus 3 digits separated by a hyphen (0000-000) but it is not rare to

find only the first 4 digits. Furthermore most of the changes are made to the last 3 digits so data linkage

using complete postal code (7 digits) will always be challenging.

17 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =

𝑇𝑃

𝑇𝑃+𝐹𝑃

There are obvious different protocols on registering address between the two data sources but what

is more problematic is that there is not a guaranteed consistent protocol in the web scraped data.

In Hotels.com’s postalCode column about 1 % of were found cases that used a space character

between the hyphen that separates the first 4 digits from the remaining 3 and 8% did not had any

hyphen. Only about 9% (n=1 684) of postal codes from Hotels.com had a length of 8 characters, which

is the expected length for a complete postal code.

The added benefit of this step is that cases when the last three postal code digits are different

(eventually updated) or non-existing in one of the data sources can be extracted. The results can be

found in Table 53.

Table 53. Match by name, postal code (4 digits) and locality (OSA <= 3)

osa n percent

0 556 83.6

1 45 6.8

2 42 6.3

3 22 3.3

Total 665 100.0

The first relevant fact is that the number of cases with exact matching (OSA = 0) is higher in comparison

with the previous step: discarding the last three digits of the postal code allowed an increase of 98

direct matching cases. On the other hand, after manually validating the results, it was concluded that

the added benefit of using only the postal code’s first four digits is, quite minimal as it can be confirm

in Table 54.

Table 54. Results for TP / FP (OSA <= 3)

val n percent

FP 13 2.0

TP 652 98.0

Total 665 100.0

Not surprisingly, False Positives increase with string distance even though they do not surpass the

number of True Positives.

Table 55. Validated results by string distance (OSA <= 3)

osa FP TP Total

0 0 556 556

1 0 45 45

2 2 40 42

3 11 11 22

Total 13 652 665

Below some examples of strings from Hotels.com and survey data that are False Positives and have an

OSA string distance of 3 can be found:

“soria 1250 lisboa” VS “florida 1250 lisboa”

“patria 1050 lisboa” VS “italia 1050 lisboa”

“flor baixa 1100 lisboa” VS “lis baixa 1100 lisboa”

In conclusion, string distance linkage from Hotels.com data to survey data, with a threshold of up to

3 OSA allows a 35.1% match with a Precision of 0.98.

4. Linking web scraped data to survey data by Latitude and Longitude

Geocoding data (latitude and longitude) from Hotels.com and survey data was provided by the

geolocation tool developed by the Polish Team and it was described in use case WPJ.4a.PT. The process

of attributing coordinates uses address information so better results are expected when protocol used

to register addresses is stricter. Additionally addresses from hotel accommodations like the ones used

in this use case’s reference data are more likely to have complete and accurate addresses than, for

example, tourist apartments.

Geolocation tool provided coordinates for 90.2% of cases (n=1 329) from a total of 1 474 in survey

data. The equivalent Hotels.com data is higher: geolocation tool provided coordinates for 96.2% of the

cases (n=1 785) from a total of 1 856. However, these include duplicate values, that is, duplicate

coordinates for different accommodations. Once again, this would be expected in the case of tourist

apartments or bed & breakfast accommodations (one or more can share the same building) but not

quite so in hotel equivalent touristic accommodations. Once these duplicates coordinates are removed

84.7% (n=1 248) of unique coordinates were found in survey data and 85.2% (n=1 581) in Hotels.com.

Following this data cleaning process and using latitude and longitude, it was possible to direct match

635 accommodations from Hotels.com and survey data. These results were manually validated by

cross referencing name and address: 8 cases turned out to be False Positives.

Table 56. Linking web data to survey data by Latitude and Longitude: Results for TP/FP

val n percent

FP 8 1.3

TP 627 98.7

In conclusion, direct match from Hotels.com data to survey data with coordinates allows a 33.8%

match with a Precision of 0.99.

Issues/Exceptions:

Duplicates strings resulted from data cleaning or concatenation of values to be matched (when

concatenating, for example, name + postal code + locality) are not contemplated in this method.

A significant amount of data is still yet to be determined as False Positive or False Negatives.

Output summary:

Approximate string matching web scraped data to survey data using accommodation name, postal

code (either with 7 or 4 digits) and locality within Optimal String Alignment (OSA) distance of up to 3,

proved to be solid method for combining data.

Direct matching with coordinates provided by geolocation tool has different but equivalent results in

terms of number of matches.

Both methods for combining data result in matching over 30% of Hotels.com accommodations with

very high Precision values. When comparing both methods, one finds that there are 386

accommodations that match both with string distance and coordinate.

Table 57. Comparing results from string distance and coordinates

Direct match by coordinates

Match by string distance FP TP NA Total

FP 1 2 10 13

TP 1 386 265 652

NA 6 239 946 1 191

Total 8 627 1 221 1 856

When combining both approaches it is possible to match 890 accommodations, which represent 48%

of Hotels.com cases. Although this is not a very high matching rate, this approach guarantees a low

number of False Positive, therefore a very high precision value.

Use Case Definition

Description: Use of web scraped data for the purpose of survey frame update.

Preconditions: At least one daily file of web scraped data from one or various portals. In

case of a longer series of scraped data file with unique accommodation

establishments would be needed.

Hotels.com and Booking.com reservation portals were scraped, as these were equipped with the

programmes for scraping from the Polish team. Despite lower preference of Hotels.com in Slovakia,

Booking.com portal is a clear leader and thus can provide most exhaustive information for a possibility

of a survey frame update. There are also other booking portals in Slovakia, e.g. Megaubytovanie.sk,

offering domestic accommodations, but they were not scraped due to the technical limitations and

lack of expertise in this field.

Data scraped from Hotels.com covered 608 unique accommodations, whereas data from Booking.com

covered 5 253 accommodations. Considering the size of Slovakian survey frame, which consists of

5 930 unique records (as of 31.12.2019), it was decided to use the Booking.com data within this case

study.

First of all, GPS coordinates were imputed (longitude and latitude) into the survey frame file. HERE

Maps API tool prepared by the Polish team was used. When comparing with another similar tools, this

one performed better and reached higher precision of localized addresses. Majority of them were

designated precisely, some of them were not exactly found and thus replaced by a street or even city

centre and some of them were totally bypassed. The Table 58 illustrates the results of this procedure.

Table 58. Results of imputing GPS coordinates to the survey frame

Number of records 5 930

Uniquely localized 4 440

Not localized at all 104

Repeated coordinates (city centres, street centres) 1 386

The frame that was used do not theoretically cover the whole base of scraped data. There are two

main reasons for it – first, the frame was lastly updated on 31 December 2019, there can be new

accommodation establishments already active on reservation portals and second, there are many

physical persons offering an accommodation establishment at web portals, which are not part of the

survey frame. Nevertheless, after the identification of units not covered by the frame, these can be

further filtered and only those relevant for the survey further processed.

There was a major difference between data from Hotels.com and Booking.com. Whereas Booking.com

data were more exhaustive in terms of volume, Hotels.com were enriched by the addresses of

accommodations. These can bring another important information into the linking process, as many of

the GPS coordinates in the frame are not unambiguously and precisely assigned. On the other hand,

due to the tight coverage of the Hotels.com data it was decided to use the Booking.com data for the

sake of this use case. There are three linking criteria – municipality, name of the accommodation and

the distance.

There is one crucial issue when it comes to the distances between establishments – they do not

perfectly match. Coordinates imputed with HERE Maps API can be, in some cases, significantly different

to those used by Hotels.com or Booking.com data. On the reservation portals they are usually pointing

at the access road or at the front of a building, whereas coordinates in the survey are stuck to a building

itself, which can cause various distances. They can be calculated e.g. by the Haversine formula. The

following picture illustrates the difference between the frame and Hotels.com coordinates.

Figure 49. Coordinates in the frame (left) and on the Hotels.com (right)

Taking into consideration that a large part of the survey coordinates points at some city or street

centres, using only distances as a linking criterion can cause many false matches (False Positives). This

problem can be solved by the use of addresses, when an accommodation is linked by a similarity in the

name, within a similar street name and city. Similarity can be calculated by e.g. Jaro-Winkler formula,

which measures distances between string and usually yields satisfying results. However, in

Booking.com data there are no addresses available distances and establishment names within

a specific municipality were used.

The linking procedure is as follows:

Identification of the municipality names and their standardization in both Booking.com and survey

frame files, in order to have a stable blocking structure. Two accommodations can be linked only

if they are within the same municipality.

Cleaning and standardization of the accommodation names – each put to lower case, any

punctuation omitted, key words like apartment, hotel, pension, chateau, etc. deleted. Otherwise,

Jaro-Winkler distance can score very high values, only due to the similarity in the types of

establishments in their names.

Haversine distance calculation between two points.

Linking either none or the most suitable accommodation establishment, according to selected

criteria.

Some of the municipalities were unique in both files, meaning that their accommodations are also

unique and can be directly assigned as True Negatives (TN). There are 201 units in scraped file within

a city not covered by the frame and 933 units vice versa, which are non-negligible amounts.

The distances (in meters) within municipalities are displayed by the Figure 50.

Figure 50. Haversine distances between scraped and registered accommodation establishments within municipalities (in meters)

When it comes to the minimum distances between two accommodations within a city, they are

certainly low. However, in the areas with high density of tourism or in cases when GPS coordinates in

the survey frame were not precisely assigned or if there is nothing within a city to truly link, choosing

a minimum distance as a criterion is insufficient and incorrect. Due to the fact that 24% of coordinates

in the frame are misleading, another criterion was added to the process – name similarity.

Two accommodations within a municipality are linked only if the Jaro-Winkler distance of their names

is below 0.2; in such a case the closest candidate is selected. To specify a circle of where to search for

possible links, a distance threshold is set.

Figure 51. Minimum Haversine distance between scraped and registered accommodation establishments within municipalities.

Several threshold settings were tested – 30, 50, 70, 100, 200 and 500 meters. Each time a random

sample was selected from the linked database, which was used for a clerical review in order to assess

the links with respect to the real status. The following confusion matrix (from the point of view of the

scraped data) was considered.

The following confusion matrices relate to the different threshold settings. Since the linked files were

reviewed only as random samples, values in the matrices are just in the form of estimates of real

values.

Table 59. Confusion matrix for the threshold of 30 meters

Threshold = 30 m Actual

Matched Not matched

Predicted Positive 45 0

Negative 45 163

Matched Not matched

Negative 46 158

Matched Not matched

Predicted Positive 51

Negative 42 160

Matched Not matched

Negative 50 149

Matched Not matched

Negative 33 161

Matched Not matched

Negative 46 138

The level of False Positives is very low. It is due to the algorithm, which uses similarity criterion for the

names of the accommodation establishments to make a link. Therefore, the vast majority of the links

that are made are True Positives. On one hand, it increases the quality of the links made, on the other

it also increases the level of negatives – not only True Negatives but also False Negatives (missed links).

For the purposes of the survey frame update mainly negatives are of interest (in contrast to the usual

focus on the positives). What needs to be maximized is the quality of True Negatives by decreasing the

number of False Negatives and False Positives, as they both cause a loss of possible candidates for a

frame update. It should be noted that with the use of scraped data units from the frame cannot be

deleted as they have no obligation to advertise themselves on the Internet and also due to

considerable inaccuracy of the linking procedure. Units that are found in the data and are not part of

the frame can be further examined and just after that added to the database. Linking can give some

advice on which units to check for update but cannot be used as an automatic process, as survey frame

contains many information that cannot be scraped from the Internet. Strict linking rules yield many

negatives, when only perfect matches are taken into account. In such a case the set of negatives would

contain significant ratio of False Negatives, which would require manual review. On the other hand,

relaxed rules lead to smaller group of negatives and thus fewer manual examination but tend to

increase False Positives, which absorb some truly unmatchable units that should be subject of the

frame update.

Based on the confusion matrices the following characteristics can be calculated.

Table 65. Calculations of precision, sensitivity, specificity and accuracy

Threshold Precision Sensitivity Specificity Accuracy

30 m 1 0.5 1 0.82

50 m 1 0.52 1 0.82

70 m 1 0.55 1 0.83

100 m 1 0.52 1 0.8

200 m 1 0.64 1 0.87

500 m 0.97 0.6 0.99 0.81

In order not to miss out any possible candidate for the frame update it was decided to include also

name comparison, which resulted in extremely high precision for any threshold. Only justifiable units

are linked and other left for manual checking, there is no need to link accommodations with different

names (Jaro-Winkler of 0.2 is rather loose). Regarding our findings, a circle with 70 m radius was used

to search for establishments. The survey frame units with wrongly assigned coordinates would not be

reachable with even higher threshold, on the contrary, as much False Positives as possible were

avoided.

A preparation of a similar study was made, but without considering the accommodation

establishments names. A threshold of 70 metres was used, where the nearest establishment was

linked (if any within the circle). The confusion matrix is as follows.

Threshold = 70 m Without names

Actual

Match Do not match

Negative 31 97

It is obvious that the values changed a lot. On one hand, the number of True Positives raised, as there

were no further obstacles. On the other, the number of False Positives increased dramatically, which

could cause a loss of possible newly established accommodations. The group of negatives, set for

further examination, is now smaller. Under these findings, it was preferred to incorporate also the

names into the process, maybe with higher value of Jaro-Winkler distance.

Issues/Exceptions:

The use of Hotels.com data with addresses is also subject of further analysis, when the street names

and house numbers can be used instead of GPS coordinates. A higher sensitivity is expected.

Output summary:

Booking.com data were used for the study on the survey frame update. They outperform the

Hotels.com in terms of volume and GPS precision. IT was proved that it can be used to point out

possible units that can be further manually checked for the possibility to be added to the frame.

However, relatively high level of inaccuracy of the linking procedure avoids the automatic use of the

whole process at the moment.

4.5. Spatial-temporal disaggregation of data

Use Case Name Spatial-temporal disaggregation of data

Design scope (Sub-)system from which most specifications are a “black-box”18.

Use Case Definition

Description Using web scraped data from Hotels.com to disaggregate monthly survey data

on number of tourists to smaller timescales.

Preconditions Web scraped data from Hotels.com is available for several months. Survey

data for several months is available.

Data on number of tourists staying at hotels is available through the tourism survey. This survey

however is conducted on a monthly basis and does not tell how the tourists were distributed within

this month. One might expect relatively large numbers of tourists during weekends or specific holidays.

Statistics Poland provided an R script to disaggregate the monthly survey data to daily data, using the

data web scraped on a daily basis from Hotels.com. It was adapted for use in Statistics Netherlands.

Adapting was mostly just pointing the script to the right data.

The provided script was firstly ran using input data from Poland, then with own data on Amsterdam.

The script outputs a data file. The output data from the script was visualised in graphs, see the example

below.

Figure 52. The script output data for nights spent in the tourist accommodation establishments in Amsterdam

The peaks that are clearly visible occur on Fridays. These could be people booking a weekend in

Amsterdam.

The disaggregated data shows a very low point on November 17th. This corresponds with a very high

number of offers in the web scraping data for that day. The reason why there were so many offers that

day is unknown.

The output of the script should give daily values for number of guests and overnight stays, both for

domestic and foreign tourists. In the output data, for some variables, the value is the same for every

day, see the Table 67. This also occurred when the input data from Poland was used, but in that case

it was a different set of variables being the same. The auxiliary file on R2 shows a zero for these

variables.

Table 67. The script output data with the same value for certain variables

2019-08-01 2019-09-01 2019-10-01 2019-11-01 2019-12-01 2020-01-01 2020-02-01

Amsterdam - nights

New script

Statistics Poland has provided a new version of the R script in July 2020. This script was again tested

with own input data.

The new version of the script can use several different methods for the disaggregation (LASSO, log-

linear, ridge and shrinkage). It also offers a choice of using mean price, number of offers, available

supply, or a combination of these, from the web scraping data.

All different methods were tested. A small piece of code to loop over the different methods, using the

same input data, and to create graphs for each one was written by Statistics Netherlands.

To visualise the weekend effect, the Fridays were highlighted in red in the graphs.

The problem with some variables not varying on a daily basis is fixed in this new script.

The LASSO method creates output that is similar to that from the old script, when used with the

number of offers and mean price as input. When the supply is also included, something strange

happens: in the first half of the time series the data seems inverted. Instead of peaks on Fridays, very

low values can be seen on Fridays. From roughly November onward, the situation is ‘normal’ again.

Figure 53. Output of the Lasso method for number of offers and mean price

Figure 54. Output of the Lasso method for number of offers, mean price and supply

The log-linear method in our case gives only very small daily fluctuations upon the monthly survey

values. In Figure 55, one can clearly see the jumps when going to a new month.

Figure 55. The log-linear method for number of offers and mean price

The shrinkage method (figures not included here) gave similar results as the LASSO method.

The ridge method, using the offers and mean price options, resulted in negative peaks on Fridays.

When using all three options (offers, supply and mean price), some of the Friday peaks were positive,

and some negative. Upon using only supply and mean price, the daily fluctuations became very small,

similar to the results of the log-linear method.

Figure 56. The ridge method for number of offers and mean price

Figure 57. The ridge method for number of offers, mean price and supply

Figure 58. The ridge method for mean price and supply

Issues/Exceptions:

The disaggregation shows the busiest day for starting a holiday. However, it does not show how long

the tourists are staying. One would expect a lot of tourists staying from Friday night to Sunday morning.

The disaggregated data will only show the Friday. Thereby underestimating the number of tourists on

Saturday.

The different methods all show a weekly pattern. The amplitude of this pattern is different, depending

on method used and input options. Some options result in larger differences between the peak day

and the rest of the week than others.

No direct comparison for the daily data is available. Unlike for the flash estimates, where the estimates

can be compared to the survey data when it arrives a few months later, there is no direct measurement

of daily number of tourists.

One could talk to hotel owners, and ask them if the Friday peak is something they recognize. One could

also use other secondary data to try and confirm the disaggregated patterns, for instance look at

electricity use per day, number of flights arriving per day, etc. However, these data have their own

accuracy issues.

Statistics Netherlands may in the near future obtain daily data from hotels in Amsterdam

(in collaboration with the municipality of Amsterdam), through automated systems that link directly

to booking systems in individual hotels. This data could be used to verify our method.

Output summary:

It is possible to disaggregate monthly tourism data from surveys into daily data, using web scraped

data. A clear weekly pattern was found.

The magnitude of day-to-day differences can however not be accurately determined from web scraped

data alone, and should be verified through data obtained from other sources.

Use Case Definition

Description: Temporal disaggregation of the monthly data on the occupancy of

accommodation establishments into daily data.

Preconditions: Basic knowledge of programming in R language

1. Preparation of monthly time series from the survey on the occupancy of accommodation

establishments and daily time series of statistics from the accommodation portal.

2. Construction of a regression model for monthly data using the Classic Least Square Method or

the Generalised Least Square Method.

3. Benchmarking of preliminary daily results to known monthly values.

Issues/Exceptions:

The time series of the daily statistics from the accommodation portal must be complete. Gaps in daily

data must be imputed.

Output summary:

Two data sources are used in temporal disaggregation: low-frequency data, which must be

disaggregated, and high-frequency data containing auxiliary variables. In the case of disaggregation of

data on the use of the accommodation establishments, these data include:

data from the web scraping of accommodation portals,

data from statistical surveys on the occupancy of accommodation establishments.

This use case was prepared on the basis of data concerning two regions at NUTS 2 level and six

variables: tourists and overnight stays in total and divided into Poles and foreigners. The first region

(Mazowieckie Voivodship) can be considered as a tourist region because it has a large number of

accommodation establishments where many tourists spend the night per 1 thousand inhabitants each

year. The second region (Podkarpackie Voivodship) can be considered as a non-tourist region.

Within the project, three methods have been tested:

LASSO,

log-line model,

ridge regression

with auxiliary variables, including the number of offers on the portal, the average price of offers and

the value of all offers. The models were built with all variables or only with the average price of offers.

Precision of the methods being compared

The models were estimated in R Studio using the lars and glmnet packages. The R-square was

calculated for monthly and daily data for all methods and sets of variables. No R-squared value in the

tables below means that the model has not generated acceptable forecasts, except for the average

value of the dependent variable.

Table 68. R-square of models for the tourist region

Method Variables Tourists Nights spent

Monthly data Daily data Monthly data Daily data

LASSO 3 0.9946 0.8910 - 0.2448

1 - 0.1820 - 0.1593

Log-line model 3 0.9992 0.3558 0.9993 0.2989

1 0.9992 0.2102 0.9992 0.1788

Ridge regression 3 0.9952 0.8768 0.9927 0.8474

1 0.9936 0.8431 0.9919 0.8048

Table 69. R-square of models for the non-tourist region

Method Variables Tourists Nights spent

Monthly data Daily data Monthly data Daily data

LASSO 3 0.9831 0.8822 0.9138 0.2939

1 - 0.0427 - 0.0422

Log-line model 3 0.9996 0.2639 0.9996 0.2299

1 0.9994 0.0792 0.9994 0.0766

Ridge regression 3 0.9890 0.8998 0.9751 0.7473

1 0.9614 0.4665 0.9430 0.4532

In any case, the results for the disaggregation of the number of tourists are better (a higher R-square

coefficient is achieved) than for overnight stays provided. It turns out that the R-square is similar for

all methods and regions for monthly data. In the case of daily data, differences in R-square between

regions are more visible in favour of the tourist region. The log-line model has achieved a high R-square

for monthly data, but a surprisingly low R-square for daily data. In several cases LASSO could not

generate acceptable forecasts except for the average value of the dependent variable. Among other

things, this may be due to the fact that the implementation of LASSO in the lars package allows to

generate only steps in which a new variable enters the model. The most reliable method proved to be

ridge regression. Its implementation in R allows to forecast at any value of the regulation parameter.

Details of the implementation for both methods can be found in Annex 2 – General approach for data

disaggregation.

Calendar effects and seasonality

This section presents results on the occurrence of seasonality in disaggregated data as well as on

calendar effects. Since there are no true values for high-frequency data, it is possible to rely mainly on

intuition when interpreting the results. For the occupancy of tourist accommodation establishments,

one can expect:

Greater use of accommodation establishments at the weekend than in the middle of the

week and during certain holidays and events such as New Year’s Eve party.

Dependence of the occupancy of accommodation establishments on the season in coastal

and mountain areas.

Lower usage on holidays such as Christmas and Easter.

Figure 59 shows the weekly seasonality of disaggregated data form 28th January to 31st March 2019.

Figure 59. Seasonality of disaggregated data

Based on the analysis of the figure, it can be concluded that the changes in the time series do not

appear to be random. In order to check whether there is a weekly seasonality, statistical tests were

used - seasonality tests implemented in the R seastests package. Table 70 summarizes the results of

these tests for the auxiliary variables and disaggregated data using ridge regression.

Table 70. P-value of weekly seasonality tests

High frequency variable Test

Welsch Kruskall Wallis F-Test

Number of offers 0.0009 0.01915 0.00014

Average offer price 0.2339 0.03747 0.10729

Total value of offers 0 0 0

Tourists (3 variables) 0.01385 0.00523 0.01291

Nights spent (3 variables) 0.02091 0.00782 0.00594

Tourists (1 variable) 0.31451 0.02308 0.11965

Nights spent (1 variable) 0.36063 0.02442 0.14463

The number of offers and the total value of the offers and the disaggregated data obtained from the

model with the three explanatory variables (tourists, nights spent) show significant weekly seasonality.

The occurrence of seasonality for the average price and results of a model with one variable is

ambiguous (the Kruskall Wallis test indicates the occurrence of seasonality, while the Welsch and F

Tests do not indicate the occurrence of seasonality). Nevertheless, the seasonal pattern only partially

meets expectations. There is no sign of a higher occupancy rate throughout the weekend, but only on

Fridays.

Figure 60 shows the disaggregated results for December 2019 and January 2020 - the period in which

Christmas and New Year fall (marked with red circles respectively). Their impact is evident in the

disaggregated data: lower occupancy of establishments on Christmas and higher on New Year’s Eve.

Figure 60. Calendar effects in disaggregated data

In the presented case study, ridge regression turned out to be the most reliable method in terms of

the precision and acceptability of disaggregated data, and the disaggregated results showed weekly

seasonality and calendar effects.

Use Case Definition

Description: Temporal disaggregation of main indicators from the monthly survey on

tourist accommodation establishments using web-scraped data.

Preconditions: As long time series of daily web-scraped data as possible, monthly data

from the survey on tourist accommodation establishments.

Daily data scraped from accommodation portals constitute a promising basis for the estimation of daily

tourism indicators, similar to those published from the monthly/annual survey on tourist

accommodation establishments. Generally, the number of domestic and foreign tourists and the

number of their nights spent are of most interest.

There are not too many variables scraped from the portals that can be used as independent variables

in regression models. From the whole set only daily mean price and the number of offers contain some

explanatory power; variable supply can be then added to the set by calculating the product of the

former two. As shown in the use case on flash estimates, strong correlations between survey data

(volumes of tourists and nights spent) and scraped data are present in the Slovak case. It introduces

reasonable preconditions for undertaking temporal disaggregation of monthly time series (stemming

from the survey) into daily time series by using web-scraped data.

A daily time series data on accommodations scraped from Hotels.com page from April 2019 to

August 2020 and from Booking.com page from July 2019 to April 2020 is available. Missing dates were

imputed using Kalman smoothing. For each day of scraping the number of offers, mean price and

supply (offers * mean price) were calculated. They are used as explanatory variables in temporal

disaggregation. Secondly, monthly data from the survey were put together, namely the number of

domestic and foreign tourists and the number of their nights spent, for the period May 2019 –

June 2020. In order to estimate daily time series using these data, following steps are needed:

specify the aggregation matrix A, which turns daily scraped data into monthly data,

find some feasible monthly level regression model with independent variables from among

scraped variables and with survey indicators as dependent variables,

calculate the preliminary daily series 𝑝 using the regression model,

benchmark 𝑝 such that it meets known monthly indicators from the survey.

PL team prepared the R code to this end, working with prescribed inputs (portal and survey data),

offering several methods, such as LASSO, ridge and log-linear regression as well as own shrinkage

method. Possibility of additive and multiplicative benchmarking is present. Although the code worked

very fine, was fast and convenient a further enhanced was done, mainly in terms of efficiency and

standardization but also added a feature for calculating daily estimates beyond the survey period.

Naturally, these forecasts are not benchmarked, as there is no survey data available for the current

month. Therefore, only LASSO and shrinkage methods can be used for forecasting, as they produce

quite stable estimates, which do not require drastic benchmarking. On the other hand, log-lin and ridge

methods suffer from poor estimation of the preliminary series, when benchmarking is a very crucial

step. Another option would be to use flash estimates for benchmarking of forecasted values.

Moreover, as coronavirus pandemic affected the whole tourism industry, there is a huge fall in time

our time series, mainly in April 2020, when the government measures in Slovakia restricted

accommodation owners from undertaking their business. Majority of them, however, did not interrupt

advertising on the portals. This inconsistence can thus hardly confuse our models by changing the

negative relationship between the number of offers and the number of tourists into slightly positive

one for some period. To this end, in order to produce robust and reliable estimates, the R code was

adjusted to leave out this period from modelling.

Data from Hotels.com

1. ridge

Ridge regression method brings the most variability in time series among the methods.

By comparison of statistics of both low frequency (monthly) and high frequency (daily) regression

models, combination of all three independent variables (offers, supply and mean price) explains

the variability of dependent variables the most (highest R-squared). On the other hand, variable

offers in uncorrelated with the final benchmarked daily series. The coefficient of variation of

projected series on the number of tourists is 61%. Seasonal pattern in daily series is apparent,

however, negative values can occur, especially in the case of low monthly values (coronavirus

pandemic measures imposed). The following figures illustrate the daily numbers of tourists, as

estimated with this method. Grey lines represent ends of weeks, in order to be able to detect

possible seasonal patterns. The general trend is obvious, as it stems from the monthly survey data.

The New Year’s Eve highest peak, local Saturday peaks as well as slow recovery from the pandemic

measures are reasonable. On the other hand, it seems that the series are quite volatile. Figures

for specific months (September and December 2019) allow deeper look into daily data. Although

these values cannot be compared to any official statistics, they bring at least some potential of

further examination and modelling.

Figure 61. The ridge regression method of offers, supply and mean price

Figure 62. The ridge regression method of tourists accommodated in December 2019

Figure 63. The ridge regression method of tourists accommodated in September 2019

2. LASSO

LASSO regression method performed best with the combination of offers and supply as

independent variables (highest R-squared and correlations) and in contrast to the ridge regression

brings less variability and volatility into the daily series while still preserving some seasonality.

Variable number of offers is now more correlated to the number of tourists. The coefficient of

variation of projected series on the number of tourists is 56%. Negative values can still occur.

The following figures, similar to those of the previous method, illustrate the daily numbers of

tourists. The general trend is clear but it can observed that the model probably caught the relation

between portal and survey data reversely; there is a lot of tourists accommodated on week-days

with only a minority of them during weekends – on the top of that a bottom value for the New

Year’s Eve. On the right side of the plot, in blue, there are forecasted values for the period not

covered by survey data, which seems very reasonable although these values are not

benchmarked. Despite lower volatility, this method is a subject of further adjustments and is not

applicable at the moment due to reverse weekly trends.

Figure 64. The LASSO regression method of tourists accommodated in September 2019

Figure 65. The LASSO regression method of tourists accommodated in September 2019

Figure 66. The LASSO regression method of tourists accommodated

3. shrinkage

Shrinkage method is developed by PL team. Its main idea is to adjust negative values in preliminary

series. However, as benchmarking to the monthly values can introduce negatives and it is a last

step of the process, they occur also with this method. Offers and supply were chosen as

explanatory variables, as they yield best statistics. Coefficient of variation is equal to 56%.

Shrinkage brings less variability and volatility and thus less seasonality into the series than LASSO

and ridge. Weekly trends resemble to inverse parabolas with peaks at the middle of the weeks,

which is not particularly feasible.

Figure 67. The shrinkage method of tourists accommodated in December 2019

Figure 68. The shrinkage method of tourists accommodated in September 2019

Figure 69. The shrinkage method of tourists accommodated

4. log-lin

This method suffers from catching no seasonal pattern, resulting time series look more like mean

values over the month. Log transformation destroys almost every signal in this case, as a result

preliminary series are very smoothed with lack of diversity. Therefore, it is not suitable for this

Data from Booking.com

Data scraped from Booking.com cover shorter period than data from Hotels.com. Moreover, as they

comprise much more establishments and advertisers, daily mean prices tend to be smooth with low

variance, thus constituting weaker position for modelling. What varies is the number of offers (many

accommodations with only one room), but the analysis indicated that it is not enough volume of

information for making reliable daily predictions.

From among the above-mentioned methods, ridge, LASSO and log-lin brought poor results with such

a short period of explanatory variables, the only acceptable estimates are those with shrinkage

method. The number of offers and mean price were chosen as independent variables. Coefficient of

variation of resulting time series is only 40%, generally it seems that the information is not correctly

recorded – there are peaks on weekdays, in contrast to Fridays and Saturdays. Longer time series

would be necessary.

Figure 70. The log-lin method of tourists accommodated in December 2019

Figure 71. The log-lin method of tourists accommodated

Figure 72. The log-lin method of tourists accommodated in September 2019

Issues/Exceptions:

Some other regression models and approaches can be tested to find most feasible solution. In Slovakia,

Hotels.com page is not particularly popular but was still able to produce sound background for

temporal disaggregation. Google trend data could be useful for disaggregation of monthly time series

into weekly ones. In order to calculate precision and accuracy of daily estimates, some pilot survey on

daily tourism among accommodations would be necessary.

Output summary:

From among the tested methods, Hotels.com data with the use of ridge regression with explanatory

variables number of offers, mean price and their product turned out to be the most feasible solution

for temporal disaggregation of monthly series into daily series, especially in terms of seasonality,

volatility and expected behaviour of tourists throughout the year. Booking.com data suffer from

insufficient period length and from the amount of advertising accommodation establishments, which

smooths the average room price. Longer time series of scraped data would be beneficial.

4.6. Flash estimates of the occupancy of accommodation establishments

Use Case Name Flash estimates

Design scope (Sub-)system from which most specifications are a “black-box” 19.

Use Case Definition

Description Using web scraped data from Hotels.com to create flash estimates on the

number of tourists.

Preconditions Web scraped data from Hotels.com is available for recent months. Survey data

from earlier months is available.

Statistics Poland provided an R script for creating flash estimates. This script was adapted for use in

Statistics Netherlands. The script combines survey data on the number of tourists in hotels from earlier

months with web scraped data from Hotels.com from recent months, in order to ‘predict’ what the

survey is likely to find in the recent months. Such a model could be used for creating flash estimates

on the number of tourists.

Adapting the script took little time, and was mostly about pointing the script to the location of our

data. Setting up the file with survey data for the script to use took a little bit more time.

The web scraped data was already available from use case WPJ.1.NL.

The data from the tourism survey from Statistics Netherlands had to be put in a specific format for the

model to use. This required some manual work (selecting and copying bits of data from SN website,

cut & pasting columns).

In the future, this could be automated, using the ‘open data’ web service of SN. This could be used to

create a fully-automated system, that web scrapes and creates flash estimates on its own.

The script can use two different methods to model the number of tourists, a linear model and an

ARIMAX model. It is expected that the ARIMAX model to be better at adapting to seasonality effects,

which is important because the number of tourists staying in hotels in The Netherlands is strongly

seasonal. At time of writing however, web scraped data for a full 12 months is not available yet. This

means ARIMAX model cannot be used yet, only the linear model was tested.

The script models the number of guests in hotels, as well as the number of nights spent. It also

discriminates between national guests and foreign guests. The script can use the mean price of

available hotel rooms as an input to the model, and also the number of offers available on the website.

The script was successfully ran using our own input data. The model ran for several large cities and

NUTS2-regions.

The script outputs a file with data. The output data was visualised in line graphs for analysis. Below are

two examples of these, one with the data for Amsterdam, and one with data for Den Haag (The Hague).

A complication that arose during the project was due to the effects of the COVID-19 pandemic.

Numbers of tourists were much smaller than normal, starting in March 2020. This makes it harder to

determine the plausibility of the flash estimates.

Plausibility

Some estimates looked plausible. In some cases however, the number of guests or nights in the output

was negative. In the figures below you can see a strong drop from March 2020 onward, corresponding

to fewer tourists due to COVID-19. For Amsterdam, it looks plausible. For Den Haag (The Hague), the

values become negative.

The figures below present data for 2018, 2019 and the first two months of 2020 from the tourism

survey. Data for months 03 to 05 for 2020 are estimated by our model.

Figure 73. Nights spent at the accommodation establishments in Amsterdam in 2018, 2019 and the

first two months of 2020 (data from the survey on tourist accommodation establishments)

Figure 74. Nights spent at the accommodation establishments in Den Haag in 2018, 2019 and the first two months of 2020 (data from the survey on tourist accommodation establishments)

A possible explanation for these negative values could be that the model does not work very well when

the data is very far away from the training data. In the Figure 75, the blue dots are the training data

and the orange dots are the forecasts. For March, the mean price was around 400, close to the range

of prices in the training data. For April and May, the prices were extremely low, around 100 and 200.

One could argue that the model is not valid for this price range that is far away from the training data.

The forecasts could improve once survey data for March and April is available (that is, once we have

blue dots in the 100 – 200 price range).

Figure 75. Mean price and nights spent at the tourist accommodation establishments in Den Haag [blue dots - training data, orange dots – forecasts]

0 100 200 300 400 500 600

MeanPrice vs Nights - Den Haag

Accuracy

To test the accuracy of the flash estimates, estimates using not all the months were made. For

example: data for July 2019 to January 2020 were used to train the model, and then used that model

to estimate February 2020. The estimates were then compared with the values from our survey for

February 2020, see Figure 76.

Figure 76. Comparison between estimates of nights spent at the tourist accommodation establishments and values from the survey

Our actions were repeated several months later. Therefore, data for the months during which the

pandemic was ongoing were available. Please note that the survey data for these months is less

accurate than usual, due to measuring difficulties.

The Figure 77 shows both the flash estimates based on web scraped data and the survey results. The

estimate for the given month is based on web scraping data and the survey data up to the previous

month, or up to two months earlier.

The flash estimates fail to predict the drop in tourists in April. They then actually seem to lag behind a

month. It is only after the survey data shows much lower values that the estimates start becoming

lower. Clearly, the model isn’t handling this extreme situation well.

One could say that the model is still being trained. Trained for the regime with very low numbers of

tourists. After learning this, the model would then be able to make more accurate estimates in case of

a new (similar) extreme event.

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4

2018 2019 2020

Nights Amsterdam

Survey forecast_with_data_until_december

forecast_with_data_until_january forecast_with_data_until_february

forecast_with_data_until_november

Figure 77. Flash estimates based on web scraped data and the survey results

Technical problems encountered

Whilst testing the R script, small technical issue happened: the script searches for a region in the names

of the web scraping files. It also matches part of the file. This created an error because both files

‘Amsterdam South, Amsterdam, Nederland’ and ‘Amsterdam, Nederland’ were matched to the text

“Amsterdam, Nederland”. For now, one of the files was removed to solve this. A change has to be

made to the script to become fully automated.

For two smaller cities, the script failed. The error message was (while running the ‘lm’ command):

“Error in terms.formula(formula, data = data) : duplicated name 'X55.10000_offers' in data frame

using '.' “. The reason why the script fails is unknown. The input data looks similar to that of other

cities.

In the web scraped data, the filenames contained the code ‘%20’, e.g. ‘Monthly statistics -

Groningen,%20Nederland’. The other input files do not. The ‘%20’ was manually replaced with a space

in the filenames. It would be convenient if the script that produces the monthly statistics of web

scraping would remove the ‘%20’ automatically.

In our opinion, this script can be run by anyone with basic R skills. Some documentation or a short

manual could help someone who is using the script for the first time. It is assumed that all required R-

packages and updates are available in the National Statistical Offices.

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7

2018201820182018201820182018201820182018201820182019201920192019201920192019201920192019201920192020202020202020202020202020

Nights Amsterdam

Survey Flash estimate (1 month) Flash estimate (2 month)

Output summary:

Web scraped data from Hotels.com was used to create flash estimates on the number of tourists

staying at hotels. These estimates could be accurate under normal circumstances. The current

abnormal situation due to the COVID-19 pandemic however could not be accurately estimated by the

model.

Use Case Definition

Description: Flash estimates of the occupancy of accommodation establishments.

Preconditions: Basic knowledge of R-language programming, knowledge of methods of

estimating and verifying econometric forecasts.

1. Preparation of monthly time series of data from the survey of the occupancy of

2. Aggregation of daily data from web scraping of tourist portals to monthly statistics.

3. Summary of both sources by year and month.

4. Calculation of a data correlation matrix and on its basis narrowing the set of potential

explanatory variables.

5. The use of Hellwig’s information capacity index method to select the final set of explanatory

variables for the model.

6. Building an econometric model.

7. Calculation of forecasts.

Issues/Exceptions:

In recent years, there has been a growing demand from stakeholders for data on the rapidly changing

situation in the tourism market. The time that units have to report on the occupancy of tourist

accommodation establishment and then the time needed for the official statistical units to process the

data received makes the data on the occupancy of tourist accommodation establishments often more

than one month late. The COVID-19 pandemic has forced many businesses to temporarily suspend or,

in extreme cases, close down their operations. As a result, the situation in the tourist services market

has changed. Such changes significantly reduce the quality of forecasts.

Output summary:

As a result of the work carried out, it was verified to what extent the combination of non-statistical

data (here from web scraping and available immediately after the end of the month) with official

accommodation data will affect the speed and accuracy of predictions of data on tourism statistics.

Analyses have also been carried out on the accuracy of the forecasts in the face of the imbalance in

the tourism market caused by the pandemic. The results presented below concern five voivodships,

the selection of which for analysis was dictated by several factors such as: attractiveness and size of

the tourist market proximity of borders, and diversity of tourist attractions. The results refer to the

model determined on the basis of data before the introduction of restrictions related to the

coronavirus pandemic and the results of the model based on data after the introduction of restrictions.

Table 71 and Table 72 present the quality indicators of the developed models.

Table 71. Model quality indicators before the imposition of restrictions related to COVID-19 (for data

from February 2019 to December 2019)

Voivodship Model quality

indicator Number of

tourists Number of

nights spent

Dolnośląskie R-square 0.7344 0.4497

MAPE [%] 5.6897 10.3175

Małopolskie R-square 0.6965 0.3796

MAPE [%] 7.5922 13.5518

Mazowieckie R-square 0.8234 0.7458

MAPE [%] 4.7434 5.1889

Podkarpackie R-square 0.6204 0.6539

MAPE [%] 15.9696 17.9956

Pomorskie R-square 0.5804 0.5211

MAPE [%] 22.5484 34.2698

Table 72. Model quality indicators after the imposition of restrictions related to COVID-19 (for data from February 2019 to March 2020)

Voivodship Model quality

indicator Number of

tourists Number of

nights spent

Dolnośląskie R-square 0.7499 0.5754

MAPE [%] 11.7492 14.5075

Małopolskie R-square 0.3838 0.1757

MAPE [%] 22.6783 27.7175

Mazowieckie R-square 0.6195 0.6225

MAPE [%] 125.6613 48.8748

Podkarpackie R-square 0.7033 0.6901

MAPE [%] 26.6235 31.3144

Pomorskie R-square 0.5787 0.5334

MAPE [%] 29.8027 41.1669

It can be seen that the quality of the models has decreased in the latter case. This is due to very large

nominal drops in the tested values. For all studied voivodships, MAPE (The mean absolute percentage

error) is higher in the models after the introduction of restrictions limiting or even making it impossible

to rent hotel establishments.

Therefore, the errors of expired forecasts increased due to the model’s inability to reflect the declines caused by the closure of accommodation establishments at the end of at the end of the first quarter of 2020. This situation for the Podkarpackie Voivodship can be observed in the Figure 78.

Figure 78. Number of nights spent in the period of January 2018 - April 2020 in the Podkarpackie

Voivodship

The observed declines significantly reduce the precision of forecasts. As a simplified picture of reality,

the model will only be able to reflect this state of affairs after some time. It should be remembered

that the proposed solution is based on statistics on the prices of accommodation establishments, and

even in the case of such large falls in nights spent, the prices of accommodation establishments have

not changed significantly. This may have been due to the fact that, in a situation of uncertainty as to

the duration of the restrictions, the owners could not or did not want to make decisions which would

have had a very significant impact on prices.

100000

200000

300000

400000

500000

600000

1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4

2018 2019 2020

Number of nights spent (actual) Forecast

Use Case Definition

Description: Use of possible data sources to calculate flash estimates for the main

indicators from the survey on tourist accommodation establishments.

Preconditions: As long time series of monthly web-scraped data as possible, monthly

data from the survey on tourist accommodation establishments and

possible some other time series related to tourism.

Within this use case, three data sources were analysed that can bring some prediction power for the

main tourism indicators – number of tourists and number of nights spent. First of all, data scraped

from the Internet within this project were examined, namely from Hotels.com and Booking.com pages.

Secondly, Google Trends data were used, which measure weekly (and also daily) interest in search

terms and webpages.

Booking.com

Daily scraped data from July 2019 to April 2020 from Booking.com were obtained. After processing the

data, monthly statistics were calculated (number of offers and quantiles of prices per accommodation

type), which could serve as good explanatory variables and be able to underpin our regression model.

However, shortness of these time series was not the only obstacle to find solid correlations with the

survey data. Another problem was the volume of the data, which caused very low variability in mean

(and also quantile) prices and thus blurred such an important information. The number of offers is

obviously an important variable for such a large portal but, maybe surprisingly, failed to produce

satisfying correlations with survey data (taking into account that Booking.com advertises many small

accommodations that can quickly sell off and thus substantially decrease the number of offers, strong

correlation with the number of tourists was expected). One of the reasons can be the coronavirus

pandemic, which locked all the accommodations establishments but many of them did not stop

advertising and therefore the relation between number of offers and tourism performance was

disturbed. As a result, Booking.com data are not used for calculating flash estimates within this study.

The corrplots used in the following figures represent the Pearson correlations between survey

variables (Tourists and Nights spent) and scraped variables (others). The section in the circle illustrates

the exact value of the correlation, so as the user can distinguish between values e.g. 0.6 and 0.7.

Figure 79. The matrix representing correlations between variables from Booking.com and survey on tourist accommodation establishments.

Hotels.com

Our time series of Hotels.com data are from April 2019 to August 2020 and are quite promising for

finding some sound regression model for calculating predictions. In order to enrich the set of possible

regressors, Google Trends data were added to the model, just to see if it can constitute stronger

regression power than Hotels.com data. Price variables bring vital information in possible models,

whereas number of offers strongly depends on whether the spring 2020 pandemic lockdown is taken

into account or not – there are absolutely poor results for the correlations of the number of offers and

the survey data when the whole period of April 2019 to July 2020 is taken into account. With the

exclusion of April and May 2020 the correlation levels for the number of offers increase.

From the figures below one can examine that there are several variables from Hotels.com that

correlate with the survey indicators. Apart from them, also Google Trends data are quite consistent

with the survey. Based on these values, 8 variables were chosen for further processing, namely the

number of offers (3 NACE categories, just as an attempt), lower quantile prices (Q25) for NACE 55.1

and for all establishments, median prices (Q50) for NACE 55.1 and for all establishments and Google

Trends series. The first matrix represents correlations for the whole period, the second stands for the

period without April and May 2020.

Figure 80. The matrix representing correlations between variables from Hotels.com (including Google Trends data) and survey on tourist accommodation establishments for April 2019 to July 2020

Figure 81. The matrix representing correlations between variables from Hotels.com and survey on tourist accommodation establishments excluding April and May 2020

Figure 82 depicts the monthly time series of the number of tourists from the survey and the number

of offers and the mean price from Hotels.com data, respectively. The former reveals negative

dependence of the time series (with the exception of lockdown), the latter shows positive relation.

Figure 82. Monthly time series of the number of tourists from the survey and the number of offers from Hotels.com

Figure 83. Monthly time series of the number of tourists from the survey and the mean price from Hotels.com

PL team prepared an R code for the calculation of flash estimates but SK made its own, since significant

level of customization was needed. Two approaches were used – linear model and ridge regression.

Strong multicollinearity of independent variables were encounter here and therefore not all of them

can be used in the model simultaneously. For the linear model approach, our code first calculates all

of the possible models at all levels using potential independent variables (255 models in 1.07 second)

and choose the one with most significant predictors – usually only one variable gave the best model

and it was never a Google Trends time series, Hotels.com price data tended to be better predictors

within this short regression period (April 2019 – July 2020). Despite high R-squared values (0.83 – 0.96),

plots below hint that these predictions are quite different from the actual indicators. In order to find

the best possible model, three regression periods were chosen and benchmarked to the actual values.

Presumably, longer the period better the model, as it can be learned by more information. However,

it is not always the truth.

1. April 2019 – December 2019

This very short period (only 9 data points) provided some basic information to the model, but it is

obvious that although the general trend was captured, predicted values for the number of tourists are

overestimated (with exception of February 2020). It is understandable, as the model lacked some data

on the pandemic – an unknown scenario until March 2020. If one wanted to use a model based on

Hotels.com data for the upcoming period, some “pandemic” data points would definitely needed to

be incorporated.

Figure 84. Predictions for the number of tourists at the tourist accommodation establishments based on the model built on period between April 2019 and December 2019

On the other hand, predictions for the number of nights spent are quite accurate, which is surprising,

as tourists and nights spent are almost perfectly correlated. The reason behind this is that during the

pandemic freezing the number of average nights spent by tourists was considerably higher than usual,

therefore the trough for the nights spent is not such deep.

Figure 85. Predictions for number of nights spent at the tourist accommodation establishments based on the model built on period between April 2019 and December 2019

2. April 2019 – March 2020

With three more data points the model matured a little bit, but still overestimated the number of

tourists during the lockdown period and also July by a non-negligible amount. Prediction for August is

even higher and it is in line with the general expectations.

Figure 86. Predictions for number of tourists at the tourist accommodation establishments based on the model built on period prior to the pandemic lockdown

Figure 87. Predictions for number of nights spent at the tourist accommodation establishments based on the model built on period prior to the pandemic lockdown

3. April 2019 – May 2020

Addition of critical April value in the model did not help the model. To use this method in statistical

production, one would need longer time series of monthly scraped data and, ideally return of some

stable situation in tourism.

Figure 88. Predictions for number of tourists at the tourist accommodation establishments based on the model built on period after the pandemic lockdown

Figure 89. Predictions for number of nights spent at the tourist accommodation establishments based on the model built on period after the pandemic lockdown

In ridge regression, multicollinearity of predictors is treated by reducing parameters of most of the

variables at the expense of the most significant ones, when a penalty term is introduced.

All 8 independent variables were used in the model. The results were however not acceptable, as some

weird patterns appeared in estimated time series (false trends, reverse or even negative values).

Figure 90 illustrates predictions based on April 2019 – March 2020 values.

Figure 90. Predictions for number of tourists at the tourist accommodation establishments based on the ridge regression model

Google Trends

Following advice of DE partner (Hesse), SK tried to make use of also Google Trends data, as, in contrast

to scraped data, they are available for a substantially longer time period and follow direct relation to

the tourism indicators, even in pandemic or other unpredictable times, thus constituting powerful

source of explanatory data. There are time series available on Google webpage that measure all

possible user’s interest and search terms per country and region. Data for Booking.com and

Megaubytovanie.sk pages were downloaded with weekly period since 2016 and summed together.

These two reservation portals are most popular in Slovakia with Booking.com as a global leader and

Megaubytovanie.sk as a leader in offering only domestic accommodations (booking.com and

megaubytovanie.sk were used as searching criteria and their trends added together. Keyword

ubytovanie can be also used as an alternative criterion, it exhibits sound seasonality patterns). Other

possible sources (Hotels.com, Trivago.com, Airbnb.com, etc.) did not exhibit desired seasonal patterns

and thus were not taken into account. Although search for Booking.com from Slovakia must not

necessarily stand for interest in domestic tourism, some correlations are present. Moreover, Google

Trends data record continuous interest and may be lagged to the actual use of an accommodation

(which is reported by the survey data). Our analysis indicated that they should be shifted two weeks

ahead in order to match them to our survey results. Within this study, shifted data were used.

Figure 91 illustrates relation of these time series.

Figure 91. Shifted weekly data of Google Trends time series in relation to the official data

To be able to make some model, one needed to turn weekly Google Trends data into monthly time

series. Figure 92 reproduces the previous, but with aggregated Google Trends data.

Figure 92. Shifted monthly data of Google Trends time series in relation to the official data

Regarding correlations between the survey indicators and trends data, they are all positive and of

sufficient level.

Figure 93. Correlations between the survey indicators and Google Trends data

Scatterplot below (and also time series plot above) suggests some quadratic relationship between

survey data and trends data, one might assume that the best model would be linear model with some

transformation of trends data.

Figure 94. Scatterplot of official and Google Trends data

To find the most feasible model, 8 options with 2 different prediction periods were tested. Let us

denote survey data as 𝑦 and Google Trends data as 𝑥. Models considered were:

1. 𝑦 = 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝜀

2. 𝑦 = 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2 + 𝜀

3. 𝑦 = 𝛽0 + 𝛽1𝑥 + 𝜀

4. 𝑦 = 𝛽0 + 𝛽1 ln 𝑥 + 𝜀

5. 𝑦 = 𝛽0 + 𝛽1√𝑥 + 𝜀

6. 𝑦 = 𝛽0 + 𝛽1√𝑥 + 𝛽2 √𝑥3

+ 𝜀

7. 𝑦 = 𝛽0 + 𝛽1√𝑥 + 𝛽2 √𝑥3

+ 𝛽3 √𝑥4 + 𝜀

8. 𝑦 = 𝛽0 + 𝛽1√𝑥 + 𝛽2 √𝑥3

+ 𝛽3 √𝑥4 + 𝛽4 √𝑥5

+ 𝜀

Prediction period 2016 – 2018

First, SK tried to build the model on a relatively stable data and checked if it can provide enough

information to calculate feasible predictions for 2019 (when there was a big tourism peak during the

summer season) and for pandemic lockdown. In order to select appropriate model, simulations were

made, when each time some of the data points were randomly chosen and used to estimate the

models. Mean squared errors can put some guidance into which option is the most acceptable in terms

of bias and variance. The following are the results for the number of nights spent.

Figure 95. Simulations of tested models for the number of nights spent based on Google Trends data

Figure 96. Mean squared errors of the tested models

Based on these results options 2, 3 and 6 were chosen for further processing. Model 2 performed best

for the number of tourists, with adjusted R-squared equal to 0.55 and MAPE 11.6%. When it comes to

the number of nights spent, model 2 was able to reproduce the 2019 summer peak, whereas model 3

was more powerful for the pandemic lockdown (but still very inaccurate). Model 2 was chosen to be

more acceptable, with adjusted R-squared equal to 0.62 and MAPE 14%.

Figure 97. Predictions for number of tourists at the tourist accommodation establishments based on tested model 2 with prediction period 2016 - 2018

Figure 98. Predictions for number of nights spent at the tourist accommodation establishments based on tested model 2 with prediction period 2016 - 2018

Prediction period 2017 – 2019

Secondly, SK tried to enhance the model precision by omitting 2016 data points, as there is some

discrepancy between the tourism performance and volume of Internet searches and by adding some

crucial points, namely July 2019, winter 2019 and March and April 2020. With this information there

are enough data for estimating peaks and troughs, it can be expected that even outliers like

August 2019 and May 2020 could be predicted with sufficient accuracy. Below are the results,

April 2020 data point caused dramatic behaviour for some of the models.

Figure 99. Simulations of tested models for the number of nights spent based on Google Trends data

Figure 100. Mean squared errors of the tested models

Unsurprisingly, mean squared errors for models 7 and 8 are very high, due to the extreme variance.

Models 2 – 5 were chosen for further processing. Model 5 seemed to make most feasible predictions

for the number of tourists, with adjusted R-squared of 0.67 and MAPE 54% (such high due to the

April 2020 data point). For the number of nights spent basic linear model (number 3) yielded the best

predictions in terms of peaks and troughs (but all of them were quite similar) with adjusted R-squared

of 0.78 and MAPE 36%.

The models vary and it is not straightforward to label some of them as the best. The coronavirus

pandemic affected the whole tourism, when it is not easy to forecast the future development. In order

to set a system of flash estimates calculation longer time series with more information on specific

scenarios would be needed. Moreover, preferred model can change from one month to another and

can be variable-specific. At this moment it can be claimed that Google Trends data comprise some level

of regression power but the results are not superior.

Issues/Exceptions:

Setting a system of flash estimates would require regular and continuous time series of auxiliary data,

either scraped or obtained from external data sources (like Google Trends data). Any interruption or

break of the data supply could cause failure of flash estimates production.

Output summary:

From among 3 tested data sources that could serve as instrumental for modelling of flash estimates,

Hotels.com and Google Trends data exhibited some potential and explanatory power. It is hard to say

if the model based on Google Trends data outperformed the one based on Hotels.com data, they both

showed some pros and cons:

Model based on Hotels.com data

rather stable data source with direct connection to domestic tourism

price variables strongly correlate with the tourism indicators

provided sound predictions for even pandemic lockdown (but only for number of nights spent,

which could be due to a discrepancy between survey indicators)

extensive daily scraping necessary, which can crash at any time

not specifically robust in unexpected scenarios (accommodation owners advertising even if there

are some restrictions, distribution of prices with low variance)

still too short time series

Model based on Google Trends data

easily accessible real time data on user’s searching activity, long time series available

significant correlation with tourism indicators

provided promising predictions for peaks and troughs

robust in unexpected scenarios as the user behaviour on the Internet is in line with the current

circumstances

indirect connection to domestic tourism as it is mixed with foreign tourism

impossible to calculate regional estimates

Both of the approaches can be further developed. In case of unusual situation in the field of tourism,

which was encountered, e.g. in 2020, they can provide vital auxiliary information for the models.

On the other hand, in case of stable progress, time series forecasting without external explanatory

variables could yield more feasible flash estimates.

4.7. Methodology to improve the quality of data in various statistical areas

Use Case Name Methodology to improve the quality of data in various statistical areas

Use Case Definition

Description: Estimation of the number of trips using data from a sample survey of

trips, administrative sources and big data. Estimation of tourism

expenditure with the use of data obtained in the process of web scraping

of portals related to the tourism industry.

Preconditions: Knowledge of the basic methods of machine learning and their

implementation in the R programme.

1. Collect all origin and destination airports with the use of an online flight connection search

engine.

2. Use flights schedules to derive a distribution of flights for each origin airport.

3. Attach a country where the airports are located with airport code lists (IATA, ICAO, FAA).

4. Attach a capacity of aircraft (seats) with data on technical information on aircrafts.

5. Calculate the distribution of flights (measured with seats) for each airport available from

national airports. The destination airports identified in this step will be called hub airports.

6. Calculate the distribution of flights for each hub airport. Repeat this step until all destination

airports are reached.

7. Remove all routes that are irrelevant with respect to time or cost efficiency.

8. Use data from civil aviation office to benchmark distribution of flights from a given origin

airports to the known total.

9. For each hub airport calculate the number of passenger that travels further using a relevant

statistic from the sample survey (share of tourist using airports from a given country as a hub).

10. Sum up passengers from all routes: travelling directly from origin country, travelling with the

use one hub airports, etc.

11. Using web scraping, collection of data on accommodation establishment prices, air ticket

prices and average meal prices.

12. Aggregation of the abovementioned data into statistics (average, median, quartile).

13. Combination of aggregated data with the micro-data base of the sample survey of trips

according to keys, e.g. country, type of accommodation establishment and quarter.

14. Construction of the average expenditure model using machine learning methods

Issues/Exceptions:

In the survey of trips of Poles, there are about 80 countries each year, and when estimating trips using

big data over 140 destinations are obtained. Due to the very large number of countries for which it

would be necessary to collect data through web scraping, the project work was limited to the countries

of South America with 13 countries only. Web scraping started in 2019 and continued in 2020. The

crisis related to COVID-19 has caused huge disturbances in data from accommodation portals, such as

a drop in the number of offers in certain market segments, as well as the suspension of international

flights and, as a result, a lack of data on flight ticket prices. For this reason, the description of the results

is limited to 2019. It should be noted that in order to generate reliable statistics, using data from tourist

portals, web scraping should be carried out over a sufficiently long period.

Output summary:

All the analysis results presented in this section refer to trips and expenditure of Polish residents

travelling to South American countries.

Estimation of trips in air traffic

In 2019, the number of South American countries that appeared in the survey of the participation of

Polish residents in trips was between 4 and 6, depending on the quarter. In the same period, the

number of countries generated on the basis of big data was between 8 and 10. Finally, the number of

countries with estimated trips in 2019, using the James-Stein estimator, was between 9 and 10

(see Table 73).

Table 73. Number of countries with estimated trips to South America by source

Source of data Q1 2019 Q2 2019 Q3 2019 Q4 2019

Survey of participation of Polish residents in trips 6 4 6 6

Big data 10 8 8 9

Survey of participation of Polish residents in trips

and big data combined with James-Stein’s estimator 10 9 10 9

Figure 103 shows which countries were obtained in each source as illustrated by the case of Q3 2019.

Three countries occurred simultaneously in both sources, five only in the big data collection, and two

only in the survey of participation of Polish residents in trips. Three countries did not appear in either

of these sources.

Figure 103. Countries with estimated number of trips by data source (Q3 2019)

When estimating the number of trips to countries that are in both sources (sample survey of trips and

big data), the value of the James-Stein estimator is closer, in terms of average square error, to the

results obtained from the survey of trips than to the results from big data. On the other hand, when

an estimate of the number of trips based on big data alone is obtained, the value of the James-Stein

estimator is non-zero, but not close to the results obtained from big data (the example of Argentina in

Figure 104).

Figure 104. Distribution of trips to South America in third quarter of 2019

Big data Survey of trips Results combined with James-Stein estimator

Paraguay

Venezuela

Argentina

Ecuador

Suriname

Bolivia

Guyana

Brazil

Colombia

Trinidad and Tobago

Uruguay

Survey of trips

Big data

Analysis of big data sources on tourism expenditure

Taking into account quarters, the distribution of tourism expenditure reported by respondents in the

sample survey and the distribution of prices for overnight stays in big data sources for South American

countries were compared. This analysis makes it possible to assess the extent to which it is possible to

select statistics on prices for overnight stays from booking portals, which would behave similarly to the

average expenditure for overnight stays in the survey of trips for South American countries.

The variation in time of the variables is shown in the box-and-whiskers plot, where a thick line

determines the median of expenditure, box covers first and third quartiles, whiskers reflect extremes

(determined by McGill R., Tukey J. W., Larsen W. A. (1978)) while the circles show outliers. The red line

indicates the average price.

Figure 105. Distribution of average overnight expenditure from the survey of trips for South American countries (non-generalised data) [EUR]

In the analysed period, the coefficient of variation of expenditure in the individual data in subsequent

quarters ranged from 0.40 to 1.39, with the average of 1.01. The average expenditure for overnight

stays in the survey of trips changed dynamically and irregularly. This result suggests that it is likely to

be difficult to select price statistics for overnight stays from booking portals, which have a similar

variability as the average expenditure for overnight stays in the survey.

Similarly, data from web scraping are very variable, as can be seen in Figure 106. Due to the significant

number of outliers, it was limited only to the analysis of offers whose price for overnight stays did not

exceed EUR 700. Thus, from the original collection of 1.3 million offers, nearly 4 thousand offers that

did not meet the assumed price criterion were rejected. Before their rejection, the coefficient of

variation varied, depending on the quarter, from 1.65 to 4.55. On the other hand, after excluding from

the set of offers with a price over EUR 700, the coefficient of variation averaged 1.08, which is close to

the level determined on the basis of data from the sample survey.

Figure 106. Distribution of prices for overnight stay from Hotels.com for South American countries [EUR]

It turns out that there is a high volatility of accommodation prices, but at the same time price statistics

such as the average or median are very stable over time (red line for the average and thick black line

for the median). This may suggest that the volatility of average expenditure for overnight stays

observed in the survey of trips is partly coincidental.

Different statistics on the price of overnight stays have been analysed in order to select the one that

best describes the average price for overnight stays in sample survey for the South American countries

and will be used further to estimate expenditure as an explanatory variable.

Table 74. Evaluation of the relationship between the average price per overnight stay in the survey of trips and the selected price statistics

Statistics

Variable

Average price First quartile of

prices Median prices

Third quartile of

prices

MAPE [%] 40 32 20 68

Linear correlation coefficient

0.396 0.520 0.504 0.49

The lowest value of MAPE was calculated for the median prices. It also showed a high correlation with

the average expenditure per night obtained from the survey of trips. Figure 107 shows both statistics

by country.

Figure 107. Median prices for overnight stays in the accommodation portal and the average price for overnight stays in the survey of trips

Analyses show that the classic measure of the central trend, which is the average, as well as positional

measures, such as, e.g. the third quartile, poorly reflect the average values of expenditure on

accommodation in a sample survey. The stability over time of the median accommodation price and,

at the same time, the fairly high similarity of its trend over time to the trend of average expenditure in

the survey of trips may suggest that its use may improve the stability of the results in that survey.

In order to estimate residents' tourism expenditure, data on airline ticket prices from kayak.com and

meal prices available on tripadvisor.com were also collected. However, due to the short time of data

collection and the small volume of the collection, such analyses as for overnight stays were not

conducted.

Estimation of travel expenses based on big data

Two groups of variables were used to build the model for estimating tourism expenditure:

sample survey: the number of tourists and overnight stays and a binary variable specifying the

country of destination of the trip

big data: median accommodation prices for NACE 55.1 and 55.2 establishments, median flight

ticket prices, average meal prices

ANCOVA’s analysis was used to determine the extent to which the selected variables allow the

variability of tourism expenditure to the South American countries to be explained. The ANCOVA

70,0Argentina

Bolivia

Brazil

Colombia

EcuadorParaguay

Uruguay

Trinidad and Tobago

Venezuela

Median price from accommodation portal Mean price from survey of trips

analysis showed that almost a third of the expenditure volatility can be explained by the explanatory

variables.

Figure 108. ANCOVA for tourism expenditure of tourists visiting South America in 2019

The sources of big data explain a total of 5.1% of the total expenditure volatility, which is 16.3% of the

explained volatility. The specificity of countries, which cannot be expressed by the variables from the

big data collections, represents only 0.8% of the total expenditure volatility. This means that several

variables coming from the web scraping of tourist portals fully actually explain the variability of

expenditure at the country level. The ANOVA analysis indirectly indicates that the linear model will

explain at most 31% of the expenditure variability. Therefore, without the use of more complex

methods, it will not be possible to better explain the variability of expenditure at the level of individual

data. For this reason, the best-performing methods measured by the average square error (RMSE) and

coefficient of determination (R-square), i.e. regressive tree (Breiman L., Friedman J. H., Olshen R. A.,

Stone C. J. (1984)) and the weighted method of k-nearest neighbours (Samworth R.J. (2012)) were used

during the project implementation.

Table 75. Statistics on tourism expenditure models for tourists visiting South America

Name of method RMSE R-square

Linear model 138.49 0.293

Regressive tree (cp=0.03) 119.50 0.466

Weighted method of k-nearest neighbours (k=5) 119.24 0.490

Using the k-nearest neighbours weighted method with k=5, the total expenditure of Poles travelling to

South American countries was estimated for individual quarters of 2019. Expenditure is presented on

a logarithmic scale, which shows the differences in expenditure between different countries

(see Figure 109). The result in the range of 6-7 means that expenditure was between 1 million and 10

2,7%1,5%1,0%

Residual variance Big data - flight tickets Big data - accommodation

Big data - restaurants and cafes Tourists and nights spent Country specificity

million euros, while the result in the range of 5-6 means expenditure between 100 000 and 1 million

euros, and so on.

Figure 109. Expenditure of Poles on trips to South American countries in 2019 (logarithmic scale) [EUR]

As a result of comparative analyses of journeys with the same characteristics to different South

American countries, it can be seen that the model differentiated expenditure, taking into account the

variability of data coming from big data. Based on the model, expenditure per person for a trip with

16 overnight stays in a hotel in Brazil was estimated at EUR 1 445, and expenditure for a trip to

Paraguay, under the same assumptions, was estimated at EUR 1 603. The model expenditure showed

less volatility than the expenditure from the survey of trips, which improved the stability of the results

for individual countries and quarters.

The final result was influenced both by the new distribution of trips with new countries and by the

estimation of expenses including big data sources. In order to determine how these two elements

affected the final result, tourism expenditure was estimated in the following options, taking into

account:

1. the average expenditure directly from the sample survey of trips and the distribution of trips

from the model,

2. the distribution of trips from the survey and the average expenditure from the model.

For 2019, total expenditure increased after modelling from EUR 22 335 000 by EUR 3 978 000 to

EUR 26 313 000. The amount of EUR 3 978 000 can be broken down into the effect of modelling the

expenditure itself, the change in the distribution of trips itself and the remaining effect of the total

change.

The following Sankey diagram (cf. Kennedy A.B.W., Sankey H.R. (1898)) shows the settlement of

changes in total expenditure by country and data sources.

Figure 110. Structure of expenditure on trips to South America in 2019 by country and data sources

Tourists from countries obtained from big data generated 1.5% of expenditure, and tourists from

countries that appeared in the survey in 2019, after modelling, generated 98.5% of expenditure. The

change in the distribution of trips alone generated 1% of the total expenditure, while modelling the

average expenditure alone generated 14% of the total expenditure.

Combination of administrative data, survey data and big data pertaining to air traffic enabled to

estimate trips to a larger number of countries than it results only from the survey of the participation

of Polish residents in trips.

The use of machine learning methods, taking into account data from web scraping of prices of flight

tickets, accommodation, costs of meals made it possible to estimate expenditure which are more

stable for individual countries and periods (quarters) then sample survey solely.

4.8. Experimental tourism statistics

Use Case ID WPJ.8.BG

Use Case Name Experimental tourism statistics

Use Case Definition

Description: Applying the web scraping method to extract data from Booking.com and

Pochivka.bg for statistical purposes. The Booking.com is scraped daily

and Pochivka.bg is scraped weekly.

Preconditions: IT expert with web scraping experience and knowledge. IT operator for

execution of the software. Dedicated IT environment. The subject-

matter statistician is needed for processing and analysing of scraped

1. Analysing of websites structure.

2. Studying the possibility to use a free tool for web scraping and selecting the Webscraper.io

Chrome Extension as such.

3. Configuring the Webscraper.io Chrome Extension.

4. Scraping the data manually for seven months.

5. Extracting the configurations of Webscraper.io Chrome Extension for websites as JSON files.

6. Developing the Python Scrapy software with webscraper.io JSON configuration for scraping

the Booking.com and Pochivka.bg data.

7. Configuring the Python Scrapy software to be executed automatically on scheduled time.

8. Collection of scraped data automatically from Booking.com and Pochivka.bg.

9. Processing, editing and classifying scraped data.

10. Analysing data and produce experimental outputs for tourism statistics.

Issues/Exceptions:

• Only 80 pages of Bookings.com page could be scraped by Python Scrapy software.

• Changes on websites structures.

Output summary:

CSV files with scraped accommodation data.

Estimates on scraped data:

Descriptive statistics by accommodation types.

Price index per week/month by NUTS 3

Correlation coefficient between rating and price

Average price of accommodations by months for Booking.com and Pochivka.bg

Experimental statistics on accommodations at Booking.com and Pochivka.bg

The Bulgarian National Statistical Institute (BNSI) is interested in producing experimental statistics by

applying the web scraping method to extract data from Booking.com and Pochivka.bg for statistical

purposes. On the basis of the collected scraped data and their processing, the following outputs were

produced:

Number and change of accommodations by type and months/weeks,

Number of accommodations by NUTS 3 and months/weeks,

Average price of accommodations by NUTS 3, by Type and by months/weeks ,

Price index of accommodations by NUTS 3, by Type and by months/weeks,

Correlation coefficient between rating and price,

Average price of accommodations by months/weeks for Booking.com and Pochivka.bg.

Data Sources

BNSI used the following data sources to reach the main aims of this use case:

www.Booking.com

www.Pochivka.bg

Booking.com is scraped daily and Pochivka.bg is scraped weekly.

Methodology

Regarding collection of data from Booking.com, the BNSI IT experts developed a script based on

Webscraper.io Chrome Extension, which successfully extracted data daily from April 26, 2019 until

March 15, 2020 for the purpose of the survey. Over 300 daily files with some holiday gaps were

collected. Since October last year, the Python Scrapy software with webscraper.io JSON configuration

has been developed. 281 files since October 2019 until August 2020 with daily information for

accommodations offered for two adults at current day by Booking.com were collected. The collected

data by script based on Webscraper.io Chrome Extension are not with good quality and are treated as

test data. For this reason it was decided (after working out Python Scrapy software) to use the scraped

data for October 2019 – August 2020 for the current use-case and for calculating the above-mentioned

indicators.

The daily collected information from Booking.com consists of the following characteristics:

pagination (page with offers),

priority (for sorting purposes),

element (site – www.booking.com),

hotel name,

location (for NUTS 3 classification),

price,

currency,

rating,

review_Count,

date_scraped.

BNSI collects weekly information for accommodations offered by www.pochivka.bg. The information

consists of offer URL:

pagination,

priority,

hotel_name,

rating,

location,

price_silen_leten, price_slab_leten, price_silen_zimen, price_slab_zimen, price_other_season

currency,

address,

tip_nast,

date_scraped,

The detailed flow for execution of the use-case 2 could be described in nine main steps:

1. Analysing of websites structure.

2. Studying the possibility of using a free tool for web scraping and selecting the Webscraper.io

Chrome Extension as such.

3. Configuring the Webscraper.io Chrome Extension.

4. Scraping the data manually for seven months.

5. Extracting the configurations of Webscraper.io Chrome Extension for websites as JSON files.

6. Developing the Python Scrapy software with webscraper.io JSON configuration for scraping

the Booking.com and Pochivka.bg data.

7. Configuring the Python Scrapy software to be executed automatically on scheduled time.

8. Collection of scraped data automatically from Booking.com and Pochivka.bg.

9. Processing, editing and classifying scraped data.

The processing of data consists of primary editing of the data and then classification of the edited data.

For this aim, the national classification of the populated places (EKATTE) (Universal classification of

administrative and territorial units) was used and after that it was aggregated, in order that the data

for accommodation are aggregated up to level region (NUTS 3). Furthermore, the classification for the

types of accommodation was also applied. Thus the test scraped data for availability of vacant

accommodation are distributed in three large categories: Hotels (Hotels, Motels, Apart hotels, Family

hotels), Private accommodations (apartments and guest rooms) and others, n.e.c (camping sites,

hostels and any other n.e.c accommodations).

Analysis of data is done with the aim to monitor the accommodations and their prices in the time.

All experimental outputs relate to weekly and monthly periods. The annual price of accommodation

for 2 persons has been calculated as simple mean by the statistical functions of Python script.

Results

All aggregated tables with results for the respective indicators for the period October 2019 –

August 2020 are available on the BNSI Intranet at the rubric “Experimental statistics from Big Data

sources” while they are generated automatically from the special database for big data.

This is the first experiment of BNSI for scraping data from tourist websites. In this document, for the

objectives of the project the result of our work during the two-year period of the project were

illustrated.

On the basis of the data received, one could make flash estimates for the trend for supply of

accommodation (double rooms) and the change of their prices in the time at national and regional

level, as well as per types of accommodation.

Unfortunately, these results still have only experimental character and have to be treated as such,

without evaluating their quality.

In the future, experimenting towards this direction will be continued aiming at using these data as

additional, alternative source for production of official tourist statistics.

The following indicators were produced on the weekly and monthly bases:

Indicator 1: Number and change of accommodations by type and period

Note: due to the large size of the weekly data, this document presents only graphically for

completeness.

The monthly data are presented in their wholeness.

Table 76. Number and change of accommodations by type and months for Booking.com

Hotels (Hotels, Motels, Apart hotels, Family

hotels)

Private accommodations (apartments and

guest rooms)

Others, n.e.c

Hotels Change

Private accommodations

Change

Others Change

2019-10 11 023 11 423 6 665 0 0 0

2019-11 7 781 7 468 4 778 -0.29411 -0.34623 -0.28312

2019-12 8 150 6 742 4 585 0.047423 -0.09721 -0.04039

2020-01 8 012 6 023 3 902 -0.01693 -0.10664 -0.14896

2020-02 10 172 9 373 5 780 0.269596 0.556201 0.481292

2020-03 10 382 12 026 7 802 0.020645 0.283047 0.349827

2020-04 8 806 12 380 7 373 -0.1518 0.029436 -0.05499

2020-05 9 814 12 053 7 255 0.114467 -0.02641 -0.016

2020-06 13 153 11 598 5 547 0.340228 -0.03775 -0.23542

2020-07 15 201 10 953 4 329 0.155706 -0.05561 -0.21958

2020-08 15 692 10 118 4 255 0.032301 -0.07623 -0.01709

Figure 111. Number of accommodation establishments by months and types for Booking.com

Figure 112. Changes in the accommodation establishments by months and types for Booking.com

Figure 113. Number of accommodation establishments by weeks and types for Booking.com

Figure 114. Changes in the accommodation establishments by weeks and types for Booking.com

Table 77. Number and change of accommodations by type and months for Pochivka.bg

Hotels (Hotels, Motels, Apart hotels, Family

hotels)

Private accommodations (apartments and

guest rooms)

Others, n.e.c

Hotels Change

Private accommodations

Change

Others Change

2019-10 19 821 21 292 20 903 0 0 0

2019-11 15 881 19 119 18 256 -0.19878 -0.102057111 -0.12663

2019-12 19 886 23 611 22 782 0.252188 0.234949527 0.247918

2020-01 15 900 18 730 18 176 -0.20044 -0.206725679 -0.20218

2020-02 15 849 18 621 18 209 -0.00321 -0.005819541 0.001816

2020-03 19 789 23 156 22 731 0.248596 0.243542237 0.248339

2020-04 15 830 18 362 18 110 -0.20006 -0.207030575 -0.20329

2020-05 15 816 18 294 18 034 -0.00088 -0.0037033 -0.0042

2020-06 19 645 21 050 22 355 0.242097 0.150650486 0.239603

2020-07 15 531 18 202 18 180 -0.20942 -0.135296912 -0.18676

2020-08 19 345 23 843 22 900 0.245573 0.309910999 0.259626

Figure 115. Number of accommodation establishments by months and types for Pochivka.bg

Figure 116. Changes in accommodation establishments by months and types for Pochivka.bg

Figure 117. Number of accommodation establishments by weeks and types for Pochivka.bg

Figure 118. Changes in accommodation establishments by weeks and types for Pochivka.bg

From Table 76 and Table 77 and their graphic images it is seen that since the beginning of the observed

period (October 2019) the monthly and weekly data for both websites follow smooth trend – the share

of the accommodation of the type of hotels and private apartments (double rooms) is the greatest.

This trend changes sharply in March 2020 when the lockdown for COVID-19 started worldwide. From

the graphics it is seen clearly that the number of supplied accommodations rises sharply since there is

no demand. After June 2020 it starts a smooth decrease in the accommodations, which is due to the

recovered demand on behalf of the consumers and gradually going out of the COVID collapse.

An interesting detail is the sharp decrease in the number of supply of the type of private apartments,

which shows that people seek mostly secluded accommodations but not hotel accommodation.

Due to this reason the number of supplied hotels continues to rise, even during the heaviest summer

months – July and August.

Indicator 2: Number of accommodations by NUTS 3 and period

The distribution of the offered accommodation by regions (NUTS 3) shows difference between the two

sources – Booking.com and Pochivka.bg (see Table 78 ad Table 79).

On the first website the number of the supplied accommodation is the biggest in the capital city, in the

large seaside centres – Burgas and Varna, and in the mountain centres. The other regions in the country

are far behind due to lack of demand and interest, as well as due to lack of concentration of

accommodations. This makes the structure of the accommodation at regional level significantly

uneven. Unlike booking, the biggest national tourist website – Pochivka.bg offers more evenly

distribution of the accommodation at regional level, including in smaller regional towns. Most likely

this is due to the fact that package tourist services are offered on this website, including

accommodation for rural tourism.

Table 78. Number of accommodations by NUTS 3 and months for Booking.com

NUTS 3 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08

Blagoevgrad 2 160 1 583 1 786 1 882 2 495 2 244 1 318 1 494 1 339 1 946 2 165

Burgas 3 685 2 222 2 158 1 674 2 560 2 894 3 221 4 529 5 835 4 485 4 313

Dobrich 789 452 343 261 467 519 700 995 996 970 850

Gabrovo 313 189 151 116 227 247 268 269 309 323 293

Haskovo 46 18 32 30 48 62 52 37 37 113 174

Kardzhali 176 86 53 25 127 165 160 105 96 120 97

Kyustendil 729 519 614 674 773 743 488 433 492 595 523

Lovech 169 139 125 138 184 213 287 202 169 248 254

Montana 81 35 21 31 67 49 31 23 55 55 39

Pazardzhik 686 361 303 375 497 503 359 301 454 502 370

Pernik 35 2 2 1 7 17 29 18 18 25 19

Pleven 133 84 75 61 92 224 258 103 72 70 87

Plovdiv 3 579 2 246 2 004 2 014 2 695 2 853 2 439 2 361 2 240 2 220 2 171

Razgrad 21 9 17 5 12 12 28 16 0 1 2

Ruse 514 251 308 269 461 493 362 399 417 349 351

Shumen 84 41 41 29 62 62 61 46 32 63 80

Silistra 19 12 13 16 21 16 35 41 0 31 36

Sliven 134 98 97 106 162 159 193 128 100 113 131

Smolyan 569 296 264 263 374 416 398 394 435 488 416

Sofia 276 178 127 109 237 243 264 227 183 336 331

Sofia (capital) 6 626 5 460 5 794 5 115 6 663 7 189 6 124 5 468 5 357 4 686 4 414

Stara Zagora 305 159 159 129 276 314 316 278 249 320 333

Targovishte 2 2 10 3 4 10 4 17 0 9 9

Varna 3 112 2 610 2 324 2 234 2 747 3 130 2 643 2 198 2 515 1 881 1 712

Veliko Tarnovo 1 043 677 594 510 850 979 897 805 720 925 846

Vidin 166 56 49 40 51 91 74 64 53 68 46

Vratsa 112 86 70 43 44 74 108 30 19 41 17

Yambol 57 20 37 10 24 24 86 111 69 53 77

Table 79. Number of accommodations by NUTS 3 and Months for Pochivka.bg

NUTS 3 2019-10 2019-11 2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08

Blagoevgrad 3 171 2 674 3 281 2 634 2 656 3 475 2 857 2 840 3 311 2 715 3 422

Burgas 23 039 20 172 25 194 20 096 19 956 24 960 19 950 19 733 23 423 19 496 25 066

Dobrich 2 549 2 190 2 722 2 168 2 158 2 539 1 974 2 013 2 529 2 096 2 661

Gabrovo 1 367 1 167 1 475 1 188 1 187 1 480 1 185 1 191 1 493 1 171 1 428

Haskovo 520 438 534 434 442 491 363 379 526 414 529

Kardzhali 538 444 573 458 461 576 421 430 549 457 582

Kyustendil 1 244 1 005 1 273 1 019 1 025 1 275 1 019 1 021 1 279 1 050 1 318

Lovech 1 537 1 343 1 650 1 294 1 294 1 608 1 280 1 272 1 580 1 261 1 595

Montana 463 390 498 398 408 500 402 401 478 396 495

Pazardzhik 1 720 1 459 1 781 1 426 1 413 1 721 1 365 1 365 1 683 1 389 1 777

Pernik 99 90 102 85 89 112 93 91 110 83 95

Pleven 319 260 321 261 259 315 254 252 326 265 335

Plovdiv 2 706 2 277 2 817 2 237 2 247 2 865 2 310 2 311 2 782 2 281 2 862

Razgrad 106 91 112 91 93 91 61 63 117 92 110

Ruse 867 731 915 746 735 869 652 663 853 712 912

Shumen 497 430 552 436 439 523 405 402 538 440 551

Silistra 113 87 104 84 83 108 89 87 109 94 111

Sliven 512 436 544 442 435 543 438 439 532 435 537

Smolyan 2 174 1 840 2 276 1 841 1 837 2 254 1 799 1 803 2 254 1 790 2 265

Sofia 1 505 1 279 1 605 1 297 1 299 1 482 1 142 1 149 1 594 1 250 1 585

Sofia (capital) 2 419 1 913 2 349 1 831 1 820 2 292 1 826 1 826 2 213 1 802 2 251

Stara Zagora 595 536 671 540 544 687 537 540 668 523 673

Targovishte 123 103 137 106 115 138 108 110 131 106 137

Varna 3 358 2 922 3 647 2 850 2 818 3 675 3 006 2 977 3 379 2 867 3 707

Veliko Tarnovo 1 938 1 607 1 963 1 554 1 554 2 003 1 611 1 601 1 910 1 561 1 954

Vidin 273 220 277 214 230 346 315 308 324 270 328

Vratsa 195 167 211 171 175 258 220 218 230 192 217

Yambol 619 546 696 521 526 465 249 283 525 442 605

Indicator 3: Average price of accommodations by NUTS 3, type and period

Note: Due to the large size of this kind of tables with data it is impossible to include them in this

document. All results are presented only graphically.

The average monthly and weekly prices for the period October 2019 – August 2020 for the offered

accommodations are with nearly equal values for both observed websites. As expected, the average

prices decrease during the period after the COVID lockdown and begin to rise smoothly during the

summer period and to recover their level before March 2020. This indicator could be used as

observation of the change in the average prices of the different type accommodation in almost real

time (for double rooms).

Figure 119. Average price of accommodations by NUTS3, types and months for Booking.com

Figure 120. Average price of accommodations by NUTS3, types and months for Pochivka.bg

Figure 121. Average price of accommodations by NUTS3, types and weeks for Booking.com

Figure 122. Average price of accommodations by NUTS3, types and weeks for Pochivka.bg

Indicator 4: Price index of accommodations by NUTS 3, type and period

Note: Due to the large size of this kind of tables with data it is impossible to include them in this

document. All results are presented only graphically.

Figure 123. Price index of accommodations by NUTS3, types and weeks for Booking.com

Figure 124. Price index of accommodations by NUTS3, types and months for Booking.com

Figure 125. Price index of accommodations by NUTS3, types and weeks for Pochivka.bg

Figure 126. Price index of accommodations by NUTS3, types and months for Pochivka.bg

Indicator 5: Correlation coefficients

Table 80. Correlation coefficient between rating and price for Booking.com

rating price

rating 1.000000 0.062594

price 0.062594 1.000000

Table 80 shows that there is no correlation between the rating and the price of the supplied

accommodation on booking, i.e. the high rating does not ensure high prices. The possible reason is the

small number of posts, on which the rating of a given accommodation is based.

Table 81. Correlation coefficient between rating and price for Pochivka.bg

rating price

rating 1.000000 -0.219144

price -0.219144 1.000000

The correlation between the rating and the prices on Pochivka.bg is negative but also insignificant,

i.e. it is possible that the low rating keeps the higher price of a given accommodation.

Use Case Name Experimental tourism statistics

Date of creation 05.04.2020 (Last update 02.10.2020)

Design scope (Sub-) system from which most specifications are a “black-box”20.

Use Case Definition

Description: Use Google Trends (GT) API to find proxies of number of tourists and

number of nights spend by tourists.

Preconditions: 1. User has to select the names of platforms to be checked in the API.

2. User has to select the tourist countries of interest.

3. Survey data on guests and nights spent (2015-2020) are available.

1. Extract (week) GT indexes about NL, BE, DE, US, UK and World using keywords “booking”,

“hotels”, “airbnb”, “trivago”, “tripadvisor” and “bed and breakfast” using the Google Trends

2. Concatenate weekly data and estimate monthly GT index averages.

3. Compute cumulative indices based on platforms and developments (Year-on-Year and Month-

on-Month).

4. Link GT indices to Survey data.

5. Compute Correlation matrix.

6. Use index developments to estimate number of tourists and number of nights spent by tourists

in the Netherlands.

Issues/Exceptions:

1. The GT API provides an (time dependent) index in the range 1-100.

2. GT indices are available for the last 5 years.

3. Effects of rescaling on the estimates have to be researched yet.

4. Estimates about February and March obtained at the beginning of April 2020 were not accurate

enough (see Figure 127)

5. Estimates about Augustus and September obtained in October 2020 seem reasonable

(see Figure 128)

6. This approach concentrates in hotels, holiday homes, bed and breakfast and Airbnb-

accommodations. Camping parks and other country-side accommodation forms are out-of-

scope.

7. Notice that searching an accommodation with Google (input data of Google Trends) does no

immediately means that there is a real booking of an accommodation. It does not provide

information either on the number of guests connected to the searcher. The latter means that

two persons can be searching an accommodation for just one visitor. But also that one person

can be searching (and eventually booking) an accommodation for one person or more.

Moreover, the indices returned by the GT API are in the scale 1-100. These are re-scaled

permanently21.

8. The results can be biased because the Survey data on guests and night spent does not include

guests (nor nights spent) booked via Airbnb.

9. Only GT-indices from NL were used in the exploratory estimations (based on the correlations

found). The estimates for German, British and US American tourists is not shown here as it

requires further analysis.

10. Autocorrelation (and tuning based on time-lags) have not been taken into account yet.

Output summary:

Continuous research and feasibility studies of innovative data sources is a crucial task for Statistical

Offices to support the quality and timeliness of tourism statistics. The current case study shows the

results obtained by using Google Trends (GT) at three moments in time.

T1) immediately after the announcement of the corona lock-down in the Netherlands: third week

of March 2020;

T2) during the first week of the (first-wave) corona lock-down: first week April 2020, and

T3) during the first week of the (second-wave) corona without lock-down but with strict public

measures (i.e. not enforced but strongly advised): last week of September 2020.

The main research question is about finding proxies of the number of tourists and the number of nights

spent by tourists in the Netherlands. Therefore, the country of origin of the most frequent visitors of

the Netherlands were used as input data and the platforms (that potential) visitors may be using to

find an accommodation to visit the Netherlands as keywords.

The results shown here are based on the case of guests and nights spent in hotels in the city of

Amsterdam. The correlation matrices obtained at time T1 are shown in the Figure 127. The highest

correlations values belong to “booking Amsterdam” and “all countries” (= 0.72) and “booking

Amsterdam” and “European excl. Dutch”. Surprising are the negative correlation values of “hotels

Amsterdam” and almost all countries of origin.

21 Stephens-Davidowitz, S. and Varian, H. (2014), A Hands-on Guide to Google Data, Section 4, p. 9-22, Google, Inc., September 3, 2014 (Revised March 7, 2015).

Figure 127. Correlation matrices at time T1

A cumulative GT is proposed as proxy for the number of guests spending the night in hotels in

Amsterdam. In the Figure 128, the three curves (cumulative) GT-index, survey estimates available at

the moment of the corona lock-down and number of guests estimated using the development of the

cumulative GT-index was shown. It was observed that there is a strong correlation (0.80).

Figure 128. Estimates of the total number of guests in hotels in Amsterdam at time T1

An update of the estimation has been done at time T3. Figure 129 shows the corresponding

correlations and Figure 130 the estimates. Notice that based on a suggestion of Statistics Poland and

a comment of Statistics Italy, a ranking page for travel and tourism22 has been used to check which

portals are the most relevant for the Netherlands. This page shows that Tripadvisor is more relevant

than Trivago. This information has been used in the keywords choice for the update of the results on

Google Trends.

22 https://www.similarweb.com/top-websites/category/travel-and-tourism/

Figure 130 shows again three curves: (cumulative) GT-index, survey estimates available at the moment

of the (second-wave) corona and number of guests estimated using the development of the cumulative

GT-index. A strong correlation (= 0.93) is observed.

Figure 129. Correlation matrices at time T3

Figure 130. Estimates of the total number of guests in hotels in Amsterdam at time T1

In conclusion, the cumulative GT-index seems to be a good proxy for the number of guests.

Furthermore, based on the correlations and rapid estimates, GT-indices may be considered as

a potential input for statistics. However, the method is not robust enough yet for rapid estimates and

other unforeseen events, such as the corona pandemic. Notice, though that almost all methods might

fail in such extreme conditions.

5. Conclusions

This chapter outlines the complex dynamics and the need of an up-to-date system for the integration

and monitoring of information about Tourism in Europe. It also provides a set of findings obtained

through case studies as a result of the work carried out by the ESSnet Big Data II (Pilot Track) Work

Package J on Innovative Tourism Statistics. Furthermore, it points out the benefits of using micro-services

implemented in the Tourism Integration and Monitoring System (TIMS) which were modularly tested

across eight European WPJ partners. Finally, this chapter summarizes the enriched and deepen available

evidence base on big data methodology and implemented processes as cornerstones of the prototype

TIMS system and its potentialities.

Tourism statistics in context

Changes in the behaviour of tourists, the broad adoption of new technologies in the tourism economy,

actualization of EU legislation (GPDR) as well as the rising of unforeseen large-scale events (such as the

COVID-19 pandemic), force national statistical offices to conduct activities consisting of constant

identification of new sources of information, improvement of statistical approaches of obtaining data,

and innovation in linking both to their production processes and statistical output. Meeting these

challenges is an extremely complex phenomenon. This is due to frequent limitations in access to the

resources of administrators, differences in the methodology used during data collection, processing and

compilation, as well as difficulties in the selection of methods used to combine data from external

sources with data held by official statistics.

The WPJ was implemented in cooperation with employees from eight European statistical offices. All of

them with diverse backgrounds and more importantly representing their own countries tourism realities

and potential but with the same goal: Respond to the dynamic challenges being faced by the tourism

economy. The involvement of people with many years of experience in the field of tourism and IT allowed

the development of technical and methodological solutions that can be used in the future to improve

the quality of tourism statistics.

Sources of information

The research shows that in all participating European countries, there are additional sources that can

be used to improve the quality of survey results on both the demand and supply sides of tourism

(see WPJ product Catalogue of Sources). Although at the moment, not all of the 130 sources identified

in the project have open access to data, there are chances to use their potential in the future, based

on appropriate arrangements with individual administrators.

During the project, innovative sources from smart systems were also identified. These can be an

important source of information for tourism statistics and are in the possession of many European

cities. They include data from Automatic Number Plate Recognition System (ANPRS) which enhance

the monitoring of vehicle movements or the use of parking spaces. Data provided by Smart City can be

used to estimate tourist traffic at the local level, including traffic related to the organization of mass

events such us concerts and sports events. Moreover, data source like anonymized data from mobile

networks allow to measure aggregated tourism flows (e.g. day visits, event tourism) besides overnight

tourism.

Due to the frequency of data updates, it is particularly important to recognize and use big data

collections obtained from websites. During the implementation of the project, it was shown that global

and local portals, containing information on accommodation, food and transport prices offer great

potential for the improvement of the quality of statistical data in the field of tourism. They can provide

data on the number of tourist establishment, trips of tourists and their expenses. Based on the analysis

of the scraped data, it was concluded that the offers posted on international portals have the largest

amount of relevant information about collective accommodation establishments, such as: price per

night, address details, geolocation data, number of stars, customer reviews, etc. This information can

be used for creating official statistics for both the demand side and the supply side of tourism. Notice

that web scraping supports the identification and categorization of new accommodation types,

e.g. the rising of the accommodation type “chalets” in the Netherlands which reflects the Dutch policy

efforts to flatten the curve of spreading the coronavirus pandemic in the Netherlands.

Furthermore, notice also that in three countries participating in the project (Bulgaria, Poland, Slovakia),

it was shown that local portals are richer than global ones in the offers of small accommodation

establishments, e.g. agritourism farms, which often do not advertise on international portals. It should

be emphasized that this is of particular importance in the time of the COVID-19 pandemic. This is due

to a change in the behaviour of tourists who, due to the introduced restrictions on international traffic

and maintaining social distance, began to use small accommodation establishments more often.

Therefore, as part of the project, web scraping was developed for both global and local portals.

It must be stressed that the recent draft agreement between Eurostat and the international platforms

concerning the exchange of data and metadata on short-stay accommodation for compiling official

statistics will, of course, facilitate the country’s access to these data. Nevertheless, the download of

data from websites is still necessary to enrich the information provided by official statistics by

collecting other variables on tourist accommodation establishments (e.g. their facilities, their services

for persons with reduced mobility, their geographic coordinates, etc.). This is useful concerning hotels

but, above all, to the other collective accommodation and in particular to “holiday dwellings” and

“private accommodations” to make statistics on these establishments.

Due to the international nature of the project and the large diversity of individual portals in terms of

their structure, the WPJ developed a common method for retrieving data from websites. For this

purpose, using technologies and open source libraries, a tool (Visual Modeler) was created to facilitate

the extraction of data from unstructured sources. The versatility of the tool was confirmed by all

project partners. Thanks to the built-in functionalities, it was possible to download data from web

portals in a consistent manner in all countries. Visual Modeler, in comparison to other methods of data

collection, is distinguished primarily by its user interface, thanks to which no specialized programming

knowledge is required to initiate the web scraping process23.

Furthermore, an open source visualization tool (visNetwork) has been develop to provide a dynamic

and rapid overview of the Inputs and Outputs of the WPJ pilot project, i.e. showing the data sources

and the expected results relationships across the eight participating countries (Flow Models).

23 A short video about Visual Modeler is available on https://youtu.be/cYETq0rIT9k.

Statistical approaches

During the course of the work, all project partners tested several methods for combining web scraping

data with data from the tourist accommodation survey. For this purpose, the address data was

standardized with the use of a geolocation micro-service (using the HERE Maps API). On the basis of

the obtained results, it was shown that the method of combining records, based only on address data,

is prone to errors resulting from different methods of data recording. Although the method of

combining data based on geographic coordinates allowed to obtain results in a shorter time, the

precision of the obtained results was still unsatisfactory. A hybrid solution turned out to be the most

effective way to combine data prepared by the project group. This solution consisted in combining the

data connection method based on geographic coordinates with the use of a specified distance radius

with address data. Thanks to the use of the hybrid method, results characterized by high precision

were obtained.

As part of the project implementation, many methods related to the connection, processing and

development of tourism data were used (including Distance-based data linkage with Vincenty's

formulae, Least absolute shrinkage and selection operator, LASSO, Ridge regression, Spatial-temporal

disaggregation with regularization term, Linear model, ARIMA, James-Stein estimator, Regression

Trees, Support Vector Machine, Random Forest, Optimal Weighted Nearest Neighbor Classifiers,

Predictive Mean Matching, Bayesian Linear Regression, Random Indicator for Nonignorable Data),

which allowed to:

estimate the number of residents' trips to countries not shown in the survey of trips. The

results were obtained by combining administrative data, statistical surveys data and big data

in the field of air traffic,

estimate the size of tourists' expenses using the machine learning method in terms of flight

schedules and prices of accommodation, air tickets and meals. This ensured greater stability

of the results for individual countries and quarters than the estimation of data only on the

basis of the sample survey alone,

disaggregate monthly data on the occupancy of accommodation establishments into daily

data, maintaining weekly seasonality and calendar effects,

develop flash estimates of the number of people accommodated in tourist accommodation

establishments in the T + 1 period, with significant use of data obtained through web scraping

from local and global accommodation portals. The obtained results were highly competitive

with the results obtained from the monthly survey of the occupancy of accommodation

establishments in tourism. It must be admitted, however, that the COVID-19 pandemic

situation forced the project team to revise models that are resistant to extreme situations,

improve the quality of estimates for the main Tourism Satellite Account (TSA) aggregates,

including, in particular, internal tourism consumption. The pilot application of solutions

developed during the project implementation allowed for new estimates of the total value.

However, the applied methods require further development towards the determination of the

size of the phenomena, which are currently difficult to estimate and are of great importance

in the development of the TSA. It will be particularly important for estimating the value of

second home services, estimating tourist packages, assessing the phenomenon of renting cars

by tourists, as well as estimating the value of sales of conference services.

Link to production process and statistical output

In many countries, tourism is treated as a priority sector due to its role and benefits it brings to the

economy. Possessing reliable and up-to-date information on tourist traffic, the length of tourists’ stay

and the level of their expenses opens up new possibilities for conducting an effective tourism policy.

The solutions developed during the project implementation allow for the provision of more detailed

and up-to-date data on tourism at the European level, without the need for additional research.

However, due to the complex nature of the phenomenon of tourism, the innovative methods of

collecting and combining data used in the project require further improvement and adaptation to the

capabilities and resources of individual countries. Additionally, when developing solutions dedicated

to tourism statistics, the impact of unforeseen phenomena such as the current COVID-19 pandemic on

the tourism industry should be taken into account. Nevertheless, the solutions presented in this report

are an important step towards creating a coherent information system useful for authorities,

representatives of the tourism industry and tourists themselves.

References

Abdulkadri A., Evans A., Ash T., (2016). An assessment of big data for official statistics in the Caribbean:

Challenges and opportunities, Studies and Perspectives Series – The Caribbean No. 48, 58

Altin L., Tiru M., Saluveer E., Puura A., (2015). Using Passive Mobile Positioning Data in Tourism and

Population Statistics, NTTS 2015 Conference abstract

Andreou E., Ghysels E., Kourtellos A., (2012). Forecasting with mixed-frequency data

Box, G. E. P.; Tiao, G. C., (1973). Bayesian Inference in Statistical Analysis. Wiley

Braaksma B., Zeelenberg K., (2020). Big data in official statistics, CBS Discussion Paper

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., (1984). Classification and Regression Trees,

Wadsworth International Group, Belmont

Brownlee, D. J., (2017). Machine Learning Mastery With R. Melbourne: Jason Brownlee

Cai D., Yu S., Wen J.-R., Ma W.-Y., (2003). Extracting content structure for web pages based on visual

representation

Chow, G.C., and A.L. Lin., (1971). Best Linear Unbiased Interpolation, Distribution, and Extrapolation

of Time Series by Related Series. Review of Economics and Statistics no. 53 (4):372–375.

doi: 10.2307/1928739

Christen, P., Goiser, K., (2007). Quality and Complexity Measures for Data Linkage and Deduplication

[in:] Guillet, F. J., Hamilton, H. J. (eds.) Quality Measures in Data Mining, SCI, Springer, Heidelberg,

Vol. 43, pp. 127–151

Cockburn, A., (2001), Writing effective use cases. Addison-Wesley. ISBN 0-201-70225-8. OCLC

44046973

Cristianini, N., Shawe-Taylor, J., (2000). An Introduction to Support Vector Machines and other kernel-

based learning methods, Cambridge University Press

Dagum, E.B., Cholette P.A., (2006). Benchmarking, Temporal Distribution, and Reconciliation Methods

for Time Series, Lecture Notes in Statistics. New York: Springer

Daas P., Ossen S., Vis-Visschers R., Arends-Tóth J., (2019). Checklist for the Quality evaluation of

Administrative Data Sources, CBS Discussion paper 09042

Daas P.J.H., Puts M., Tennekes M., Priem A.,(2014).Big Data as a Data Source for Official Statistics:

experiences at Statistics Netherlands. Proceedings of Statistics Canada International Methodology

Symposium 2014, Gatineau, Canada

Daas P.J.H., Puts M.J.H., Buelens B., van den Hurk P.A.M., (2015). Big data as a source for official

statistics.31, 249–269

Demunter C., (2017). European Commission, DG EUROSTAT STATISTICAL WORKING PAPERS, Tourism

statistics: Early adopters of big data 2017 edition, Eurostat, STATISTICAL WORKING PAPERS

Demunter C., Seynaeve G.,(2017). Better quality of mobile phone data based statistics through the

use of signalling information – the case of tourism statistics, NTTS Conference

Denton, F.T., (1971). Adjustment of Monthly or Quarterly Series to Annual Totals — Approach Based

on Quadratic Minimization. Journal of the American Statistical Association no. 66 (333):99–102

Di Fonzo, T., (2003). Temporal Disaggregation of a System of Time Series when the Aggregate is Known:

Optimal vs. Adjustment Methods. In Paris–Bercy 05/12/1994 – 06/12/1994. Workshop on Quarterly

National Accounts, edited by R. Barcellan and G.L. Mazzi, 63–77. Luxembourg: Office for Official

Publications of the European Communities

Ding, Y. Fienberg S.E., (1994). Dual system estimation of Census undercount in the presence of

matching error, Survey Methodology, 20, 149-158

European Commission, (2014). Feasibility Study on the Use of Mobile Positioning Data for Tourism

Statistics, Eurostat, Consolidated report

European Commission, Consumer market study on online market segmentation through personalised

pricing/offer in the European Union, (2018) - Request for Specific Services 2016 85 02 for the

implementation of Framework Contract EAHC/2013/CP/04, Final report, Ipsos – London Economics –

Deloitte consortium

Fernandez, R.B., (1981). A Methodological Note on the Estimation of Time-Series. Review of Economics

and Statistics no. 63 (3):471–476. doi: 10.2307/1924371

Fourdrinier D., Wells M. T., (2012). On Improved Loss Estimationfor Shrinkage Estimators. Statistical

Science, 27, 1, 61-81

Friedl, J. E. F., (2006). Mastering Regular Expressions, O'Reilly Media, Third edition

Glez-Pena, D, Lourenço A, López-Fernández H, Reboiro-Jato M, Fdez-Riverola F., (2013). Web scraping

technologies in an API world. Briefings in Bioinformatics, Vol.15, Issue 5: 788–797

Green, E. J, Strawderman W. E., (1991). A James-Stein Type Estimator for Combining Unbiased and

Possibly Biased Estimators. Journal of the American Statistical Association, 86, 416, 1001-1006

Gretzel, U., Zhong, L., Koo, C. (2016): Application of smart tourism to cities. International Journal of

Tourism Cities, 2(2). https://doi.org/10.1108/IJTC-04-2016-0007

Hedden, H. (2016). The Accidental Taxonomist, 2nd edition

Heerschap, N. Ortega S., Priem A. Offermans M., (2014). Innovation of tourism statistics through the

use of new big data sources. In 12th Global Forum on Tourism Statistics, Prague, CZ

Heerschap, N., (2017). Ontwikkeling van een datasysteem voor toerisme, CBS-NRIT Symposium

Trendrapport Toerisme 2017, November 2017, The Hague, NL

Heerschap N., (2014). Mobile phone data and other new sources for tourism statistics (in Dutch)

Section 10.2, Statistics Netherlands book on Tourism, 158-168, The Hague, The Netherlands

Heerschap N.M., Ortega Azurduy S.A.,. Priem A.H., Offermans M.P.W., (2014). Innovation of tourism

statistics through the use of new Big Data sources, paper presented at the Global Forum on Tourism

Statistics, Prague

Heerschap N.M., Ortega Azurduy S., (2020). Tourism statistics and the use of social media data,

Discussion Paper

Hellwig, Z., (1968). On the optimal choice of predictions, UNESCO, Paris

Hellwig, Z., (1972). Approximative Methods of Selection of an Optimal Set of Predictors, [in]: Study XVI

of the UNESCO Statistical Office, Towards a System of Quantitative Indicators of Components of

Human Resources Indicators Development, UNESCO, Paris

Ho, T. K., (1998). The Random Subspace Method for Constructing Decision Forests, IEEE Transactions

on Pattern Analysis and Machine Intelligence, 20 (8), pp. 832–844

Hoekstra,R. ten Bosch, O. Harteveld F., (2010), Automated Data Collection from Web Sources for

Official Statistics: First Experiences, Project report, June 2010, The Hague

Hoerl A. E.; Kennard R. W., (1970). Ridge regression: Biased estimation for nonorthogonal problems.

Technometrics. 12 (1): 55–67. doi:10.1080/00401706.1970.10488634

Jaccard P., (1901) Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions

voisines. Bulletin de la Société Vaudoise des Sciences Naturelles 37, 241-272

Jolani, S., (2012). Dual Imputation Strategies for Analyzing Incomplete Data, Dissertation. University

of Utrecht

Kennedy A. B. W.; Sankey H. R., (1898). The Thermal Efficiency of Steam Engines. Minutes of the

Proceedings of the Institution of Civil Engineers. 134 (1898): 278–312

Kohavi R., (1995). A study of cross-validation and bootstrap for accuracy estimation and model

selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. San

Mateo, CA: Morgan Kaufmann. 2 (12): 1137–1143

KOMUSO team, (2019).“Quality Guidelines for Multisource Statistics - QGMSS”, ESSnet project

Kruskal, W. H., Wallis W. A., (1952). Use of Ranks in One-Criterion Variance Analysis. Journal of the

American Statistical Association 47 (260), 583-621

Kwok L., Yu B., (2015). - Taxonomy of Facebook messages in business-to-consumer communications:

What really works? , Tourism and Hospitality Research

Laloli H., (2015).Taxonomy Tools, Intern CBS-rapport

Laloli H., (2015). Taxonomy for the CBS, Intern CBS-presentation

Levenshtein V. I., (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet

Physics Doklady. 10 (8): 707–710

Litterman, R.B., (1983). A Random Walk, Markov Model for the Distribution of Time Series. Journal of

Business & Economic Statistics no. 1 (2):169–173

Maravall, A., (2011). Seasonality Tests and Automatic Model Identification in TRAMO-SEATS. Bank of

Maślankowski M., Salgado D., Quaresma S., Ascari G., Brancato G., Di Consiglio L., Righi P., Tuoto

T.,Daas P., Six M., Kowarik A., (2020) Quality report template, ESSnet on Big Data II

Maślankowski M., Salgado D., Quaresma S., Ascari G., Brancato G., Di Consiglio L., Righi P., Tuoto

T.,Daas P., Six M., Kowarik A., (2020). Revised Version of the Quality Guidelines for the Acquisition

and Usage of Big Data - Final version, ESSnet on Big Data II

McGill R., Tukey J. W., Larsen, W. A., (1978). Variations of box plots. The American Statistician, 32, 12–

16. doi: 10.2307/2683468

McKercher B., (2016). Towards a taxonomy of tourism products, Tourism Management, Volume 54,

June 2016, pp. 196-208

Piela P., (2018). Non-traditional data sources in Social Statistics of Statistics Finland, 17th Meeting of

ECLAC, Santiago de Chile

Powers D. M. W., (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness,

Markedness & Correlation. Journal of Machine Learning Technologies. 2 (1): 37–63

Rubin, D. B., (1986). Statistical Matching Using File Concatenation with Adjusted Weights and Multiple

Imputations, Journal of Business & Economic Statistics, 4 (1), pp. 87–94

Salgado D., Oancea B., (2020). On new data sources for the production of official statistics

Samworth R.J., (2012) Optimal weighted nearest neighbour classifiers. Annals of Statistics, 40, 2733-

Sidorov G., Gelbukh A., Gómez-Adorno H., Pinto D., (2014). Soft Similarity and Soft Cosine Measure:

Similarity of Features in Vector Space Model. Computación y Sistemas. 18 (3): 491–504.

doi:10.13053/CyS-18-3-2043

Spinder, S., (2019). Estimation of the number of guests and overnight stays in platform-related

accommodations, Discussion paper, October 2019, The Hague

Stateva G., ten Bosch O., Maślankowski J., Barcaroli G., Scannapieco M., Summa D., Greenaway M.,

Jansson I., Wu D., (2018). Methodological and IT Issues and Solutions. ESSnet Big Data I

Stateva G., ten Bosch O., Maślankowski J., Righi A., Scannapieco M., Greenaway M., Swier N.,

Jansson I., Wu D., (2016) Legal aspects related to Webscraping of Enterprise Web Sites. ESSnet Big

Data I

Stehman, S. V., (1997). Selecting and interpreting measures of thematic classification accuracy.

Remote Sensing of Environment. 62 (1): 77–89

Stephens-Davidowitz, S. and Varian, H. (2014), A Hands-on Guide to Google Data, Section 4, pp. 9-22,

Google, Inc., September 3, 2014 (Revised March 7, 2015)

Ten Bosch, O., Windmeijer, D., van Delden A., Van den Heuvel, G., (2018). Web scraping meets survey

design: combining forces, October 2018, Barcelona

Tibshirani R., (1997). The LASSO Method for Variable Selection in the Cox Model. Statistics in Medicine.

16 (4): 385–395. CiteSeerX 10.1.1.411.8024. doi:10.1002/(SICI)1097-0258(19970228)16:4<385::AID-

SIM380>3.0.CO; 2-3

UNECE, (2014). A Suggested Framework for the Quality of Big Data, Deliverables of the UNECE Big Data

Quality Task Team, December, 2014

Van Delden, A., Scholtus, S., De Waal, T., (2019). Methods for Measuring the Quality of Multisource

Statistics, In Conference: Use of administrative data in social statistics, June 2019, Valencia

United Nations, Commission of the European Communities–Eurostat, World Tourism Organization,

and Organisation for Economic Co-operation and Development, (2010). Tourism Satellite Account:

Recommended Methodological Framework 2008. Luxembourg, Madrid, New York, Paris. United

Nations publication, Sales No. E08. XVII.27. (herein referred to as “TSA: RMF 2008”)

Wang, X., Li, X.R., Zhen, F., Zhang, J.H., (2016). How smart is your tourist attraction?: Measuring tourist

preferences of smart tourism attractions via a FCEM-AHP and IPA approach. Tourism Management,

54: 309–320. https://doi.org/10.1016/j.tourman.2015.12.003

van der Valk J., Mitriaieva A., Seifert W., Strauch K., Skovbo M., Schnor O., Cierpial-Wolan M.,

Jasiukiewicz D., Kuzma I., Balmand S., Franconi L. (2016). Border Region Data collection, Project

n° 2016CE16BAT105, Final report

Weaver W., (1955). Translation. Machine translation of languages, 14:15-23

Welch, B. L., (1951). On the Comparison of Several Mean Values: An Alternative Approach. Biometrika

38 (3/4), 330-336

Winkler W. E., (1990). String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter

Model of Record Linkage. Proceedings of the Section on Survey Research Methods. American Statistical

Association: 354–359

Zorpas, A., Voukkali, I., Loizia, P., (2014). The impact of tourist sector in the waste management plans.

Desalination and Water Treatment. 56. 1–9. http://dx.doi.org/10.1080/19443994.2014.934721

Zult D., de Wolf P.P., Bakker, B., van der Heijden P.G.M., (2019). A General framework for multiple-

recapture estimation that incorporates linkage error correction, May 2019, The Hague

WP8 (2018), Deliverable 8.2: Report describing the quality aspects of Big Data for Official Statistics,

ESSnet Big Data I, version 7 May 2018, pp. 6-7. (based on Suggested Framework for the Quality of Big

Data, Deliverables of the UNECE Big Data, Quality Task Team)

WPF (2019a), Deliverable F1: BREAL- Big Data REference Architecture and Layers Business Layer,

ESSnet project, December 2019

WPF (2019b), Process and Architecture, ESSnet Big Data II, Vienna Meeting, December 2019

WPJ (2020a), Deliverable J1: ESSnet Methods for webscraping data processing and analyses,

23 July 2019

WPJ (2020b), Deliverable J2: Interim technical report showing the preliminary results and a general

description of the methods used, 7 January 2020

WPJ (2020c), Deliverable J3: Methodological framework report, 13 March 2020

WPJ (2020d), Deliverable J4: Technical Report, 16 June 2020

WPE (2020), Deliverable E3: Interim technical report concerning the conditions for using the data, the

methodology and the procedures (AIS data: Example based on BREAL, 2019)

ESSnet Big Data II (2019), Vienna Track Meeting, 11-12 December 2019

WP2 (2011) Deliverable 2.2. Use of Administrative and Accounts Data for Business Statistics, ESSnet

project, March 2013

WP2 (2011) Deliverable 2.2. Use of Administrative and Accounts Data for Business Statistics, ESSnet

project, March 2013

Annexes

Annex 1 – Quality indicators of data matching

Several indicators can be used to assess the quality of data pairing. Many of them are based on the

confusion matrix.

Table 1. Confusion matrix for data matching

true status /

linkage result Match (positive) Non-match (negative) Sum (linkage)

Link (positive) True Positive (TP) False Positive (FP) Positive (all links)

Non-link (negative) False Negative (FN) True Negative (TN) Negative (all non-links)

Sum (true status) Positive (all matches) Negative (all non-matches) All pairs (best match

Taking into account the frequency of occurrence of each of the four possible pairing results, i.e. True

Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN), a number of measures of

the quality of pairing of establishments can be built (Stehman S.V. (1997), Powers D.M.W (2011)).

These include:

sensitivity (recall) or True Positive rate (TPR)

𝑇𝑃𝑅 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

specificity (SPC) or True Negative rate (TNR)

𝑇𝑁𝑅 =𝑇𝑁

𝐹𝑃 + 𝑇𝑁

precision (or positive predictive value, PPV - positive predictive value)

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =𝑇𝑃

𝑇𝑃 + 𝐹𝑃

accuracy (ACC)

𝐴𝐶𝐶 =𝑇𝑃 + 𝑇𝑁

𝑎𝑙𝑙 𝑝𝑎𝑖𝑟𝑠

false discovery rate (FDR)

𝐹𝐷𝑅 =𝐹𝑃

𝑇𝑃 + 𝐹𝑃

false ommision rate (FOR)

𝐹𝑂𝑅 =𝐹𝑁

𝐹𝑁 + 𝑇𝑁

negative predictive value (NPV)

𝑁𝑃𝑉 =𝑇𝑁

𝐹𝑁 + 𝑇𝑁

False Positive rate (FPR)

𝐹𝑃𝑅 =𝐹𝑃

𝐹𝑃 + 𝑇𝑁

False Negative rate (FNR)

𝐹𝑁𝑅 =𝐹𝑁

𝑇𝑃 + 𝐹𝑁

F1 score

𝐹1 = 2 ∙𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∙ 𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

Youden index

𝐽 = 𝑇𝑃𝑅 + 𝑇𝑁𝑅 − 1

positive likelihood ratio (LR +)

𝐿𝑅+=𝑇𝑃𝑅

𝐹𝑃𝑅

negative likelihood ratio (LR-)

𝐿𝑅−=𝐹𝑁𝑅

𝑇𝑁𝑅

diagnostic odds ratio (DOR)

𝐷𝑂𝑅 =𝐿𝑅 +

𝐿𝑅 −

TPR, TNR, PPV, ACC, NPV, F1, Youden index, LR + and DOR are stimulants, i.e. the higher the index value,

the better the matching. The rest of these indicators are destimulants, i.e. the higher the indicator

value, the worse the matching.

However, there is no single best measure to look at or to use. Typically, more than one measure has

to be considered, as true and negative results are connected: often, the higher the number of True

Positives, the higher the number of False Positives. As an illustrative case, the number of True Positives

is highest if all links are considered positive. Additionally, costs of False Negatives or positives can be

different given different aims or applications of a data linkage. In the case of enriching the survey

population frame, False Positive linkage results may be less costly than False Negatives (since False

Positives may attract attention at a later stage of the survey anyway). For the use of flash estimates

(i.e. combining data sources on a micro level with additional characteristics to enable additional

analyses) False Positives may weight stronger than False Negatives. Often they do not attract attention

during analyses and may introduce bias into results).

Annex 2 – General approach for data disaggregation

In this Annex, a matrix notation is used to present an idea of temporal disaggregation. Assume 𝑌𝑙𝑥1 is

a low frequency data vector, 𝑋ℎ𝑥𝑘 is a matrix of 𝑘 high frequency auxiliary variables. Let 𝐴𝑙𝑥ℎ be an

aggregation matrix such that 𝐴𝑋 is a low frequency data matrix. The objective is to estimate the high

frequency data vector 𝑦ℎ𝑥1 satisfying the form 𝐴𝑦 = 𝑌.

The first step of temporal disaggregation is regression. Using auxiliary variables, a regression model is

built and preliminary estimates of 𝑝 for the variable 𝑦 are prepared. In the simplest approach, one

variable is selected and the model is estimated using the classical method of least squares, or many

variables are selected and the model is estimated by the generalized method of least squares.

Typically, aggregated 𝑝 values do not add up to 𝑌. Hence, benchmarking is the second step in temporal

disaggregation. The difference between 𝑌 and 𝐴𝑝 defined as

𝑢 = 𝑌 − 𝐴𝑝

is estimated as follows

𝑦 = 𝑝 + 𝐷𝑢 = 𝑝 + 𝑒

where 𝐷 is the distribution matrix.

Let 𝑈 be the variance-covariance matrix 𝑢. The distribution matrix for additive or multiplicative

benchmarking (Denton F.T. (1971), Di Fonzo T. (2003)) is, respectively, of the form

𝐷 = 𝐴′(𝐴𝐴′)−1𝑈−1

𝐷 = Diag(X)𝐴′(𝐴𝐴′)−1𝑈−1Diag(X).

The random component 𝑒 for high frequency data can also be modeled as a first-order autoregressive

process (Chow G.C, Lin A.L. (1971), Fernandez, R.B. (1981)) or a random walk process (Litterman R.B.

(1983))

If the results of temporal disaggregation obtained by the above method are unacceptable, regression

methods with a regularization parameter such as LASSO or ridge regression should be used.

A very useful tool that visualizes the procedure of selecting variables for the model using the LASSO

method is the LASSO path graph.

Figure 1. Sample LASSO path

In the first step, the model adds variable 1, which is the best variable in terms of mean square error.

In the second step, the model adds variable 2, and so on. In the last step, all variables are considered

and the model parameters are estimates of the least squares method. The sign of parameters may

change as new variables are added. This is because the allowable length of the parameter vector

increases and additional correlations are taken into account. In this example, the parameter relating

to variable 1 changes the sign in the fourth step. As a result, LASSO produces several sets of forecasts,

which allows one to choose an acceptable solution.

In the case of ridge regression, there is a non-negative regularization parameter. If it is zero, the ridge

regression becomes least square method. The higher the parameter value, the closer the results are

to the mean value of the forecasted variable. The next example shows how the forecasts change for

different values of the regularization parameter (the forecasts are in ascending order). Forecasts that

are very close to the forecasts of the least squares method are marked in green. Forecasts which are

very close to the average value of the dependent variable are marked in blue.

Figure 2. An example of ridge regression

As with LASSO, ridge regression produces several sets of forecasts.

Annex 3 – Combining biased and unbiased data

Assume that 𝑋 is an unbiased estimator of the parameter vector 𝜃 of length 𝑝, (𝑝 > 2) with the

expected value 𝜃 and the variance 𝜎2, and that 𝑌 is the biased estimator of 𝜃 with the expected value

𝜃 + 𝜂 and the variance 𝜏2. The James-Stein estimator for the observation vector 𝐱 and 𝐲 is of the form

𝐱 +(𝑝 − 2)𝜎2

‖𝐲 − 𝐱‖2(𝐲 − 𝐱),

where ‖𝐲 − 𝐱‖ is the Euclidean norm.

The James-Stein estimator has a lower mean square error than the maximum likelihood estimator

when the number of estimated parameters exceeds two. This property is also true when the variance

𝜎2 is not known a priori and will be replaced by its estimator of the form

�̂�2 =1

𝑛∑(𝑦𝑖 − �̅�)2

where 𝑦𝑖 are the elements of 𝐲.

If there are 𝑚 observations of the 𝐲 vector, the James-Stein estimator is given by the formula

𝐱 +(𝑝 − 2)

𝑚‖�̅� − 𝐱‖2

(�̅� − 𝐱),

where �̅� = (�̅�1, … , �̅�𝑝).

Annex 4 – Quality template for combining data from various sources

Following template is a result of work performed by the workpackage K – Methodology and quality.

Members of WPK adapted the structure of the template of Single Integrated Metadata Structure

(SIMS) to the needs of big data sources. The definitions and guidelines were based on the ESS

handbook for quality and metadata reports (EHQMR)

The members of the WPK analysed each subconcept of the EHQMR:

New subconcepts were introduced when the existing ones did not cover all relevant aspect of

quality of new data sources. These new subconcepts are indicated by an "A" for "additional"

in the subconcept number.

Subconcepts not relevant for new data sources were deleted. The numbering of the

subconcepts of EHQMR was retained so the following subconcepts are not numbered

consecutively.

S.01 Contact

SIMS Concept name Answer

S.01 Contact Not relevant as matters discussed in this chapter are a general description of quality issues encountered throughout the project duration in all countries involved. Nevertheless, contact information to the leader of the project is provided below.

S.01.1 Contact organisation Statistical Office in Rzeszów, Marek Cierpiał-Wolan

S.01.6 Contact email address

M.Cierpial-Wolan@stat.gov.pl

S.01.7 Contact phone number

00 48 17 853 52 10 w. 311

S.02 Metadata Update

S.02 Metadata update Not relevant for the pilot projects, since no Official Statistics is published.

S.03 Statistical Presentation

S.03 Statistical presentation

S.03.1 Data description As part of the project, the work in this task was concentrated on identifying new tourist accommodation establishments

(not currently included in the statistical survey). The following variables have been determined:

name of the establishment,

address details of the establishment (country, region, city, street, zip code),

geolocation of establishments (longitude and latitude),

type of new accommodation establishment (according to statistical classification),

capacity of the new accommodation establishment,

months in which establishment is operating.

The statistical output produced: Tourism Accommodation Statistics - population frame and its characteristics.

S.03.4

Statistical concepts and definitions

new tourist accommodation establishments (not included in the survey frame of tourism establishment base) were

obtained by web scraping of accommodation portals and by comparison with survey frame of tourist accommodation

establishments with the use of geolocation coordinates and address data.

geolocation of all establishments was obtained by the use of HERE Maps API to extract geolocation coordinates

(longitude and latitude) of establishments from survey and from accommodation portals. Alternatively, other internally

available tools have been used for geocoding.

types of accommodation establishments and their number of bed places showed discrepancies as booking portals have

their own classification of accommodation types. This classification do not follow the one used in official statistics. This

information was corrected by the use of machine learning methods.

months in which establishment is operating was obtained with the rule: “if the offer of a given establishment is present

during a month then the establishment is operating” – this value was also calculated with the use of machine learning.

names of the accommodation establishments mention mixed forms of accommodations (hotel/B&B and

resort/camping site), hence when the accommodation representative subscribes this accommodation establishments

in a (travel agency or aggregator) platform, he/she often leaves this field empty. This holds, in particular, for large

accommodation companies and multinational hospitality businesses.

S.3.5 Statistical unit The unit for this workpackage was a domestic accommodation establishment. Survey frame of tourist accommodation

establishment is obtained with the use of the NACE rev.2 code of activities. This does not cover all units operating in a

given country as attribution of NACE code depends on business structure of enterprises. It means that sometimes for

very small establishments, the scraped data refers to a particular person or household. In contrast to the traditional

ways, these households/persons may not be surveyed (they are not in the Business Register or their turnover or number

of employees is lower than specific threshold). But for the purposes of accommodation supply they are actually

economically active and thanks to web scraping of booking portals their activities can be included in the data

S.3.6 Statistical population Web scraped data provided data on all tourist accommodation establishments that were active in travel and aggregator

platforms. The WPJ had population set as accommodation establishments advertising in Hotels.com and Booking.com.

The statistical population were new domestic accommodation establishments with complementary information on

their types, regions and other information allowing for their identification

S.3.7 Reference area Data referred to countries that were taking part in the project as well as to tourist accommodation establishments that

were offered by accommodation portals in a given country. The web scraper of Hotels.com had proxy “destination” set

differently in each country. Some partners have been able to use just the country’s name as this “destination”, some

partners have to specify provinces and cities, and some follow NUTS classification. Commonly used aggregates for

geographical breakdown in each country was applied. The decision of set destination depended on the size of the

country, its touristic attractiveness and other issues specific to the given country.

S.3.8 Time coverage Web scraping of portals has begun in July 2019 onwards. Most of the project partners have been scraping data daily

and have a substantial time series of data. Some exceptions to that occurred when a web scraper was temporary

blocked or the structure of the website has changed and the web scraping code needed adjustments.

S.04 Unit of Measure

S.04 Unit of measure Hotels.com – unit of measure

Date - Day, Month and Year in which an accommodation establishment is detected

scrapingType - Date configuration (next day, next weekend or last Thursday of Month) used to gather accommodation

prices

destinationName - Geographical area connected to the destination_id defined by the platform and chosen by a visitor

using the search engine of the platform

offerId - Platform id assigned to an specific visitor’s search of accommodation (Demand-side)

hotelId - Platform_id (client number) assigned to an specific accommodation establishment (Supply-side)

hotelName - Name of the accommodation establishment in the platform database

accType - Type of accommodation establishment registered by the accommodation responsible to classify its

accommodation business in the platform (e.g. when searching accommodations on the platform by this “type”)

locality - Refers to the city where an accommodation establishment is located or established.

postalCode - Combination of numbers and letters in addresses that denote the neighbourhood and street to locate

establishments

latitude and longitude coordinates - typically provided in CRS (Coordinate Reference System) WGS84

street - Address, .i.e. street name and house number of an accommodation establishment

guestReviewsTotal - Cumulative number of the number of quest reviews

Booking.com – unit of measure

Date - Date - Day, Month and Year in which an accommodation establishment is detected

Accom_id - Platform_id (client number) assigned to an specific accommodation establishment (Supply-side)

Name_accom - Name of the accommodation establishment in the platform database

Name_long - Full name of the accommodation establishment in the platform database; it contains often the type of

accommodation as well

Type_accom - Type of accommodation registered by the accommodation responsible to classify its accommodation

business in the platform

Address_accom - Refers to the city where an accommodation establishment is located or established

Location – Refers to the city where an accommodation is located establishment or established.

Postal_code - Combination of numbers and letters in addresses that denote the neighbourhood and street to locate

establishments

Longitude and Latitude - typically provided in WGS84 coordinates

Street name - Address, .i.e. street name

House number - House number of an accommodation establishment

Numberl_reviews - Cumulative number of the number of quest reviews.

S.05 Reference Period

S.05 Reference period Day, Month and Year in which an accommodation establishment is detected - the moment it firstly occurs on the booking website. Then subsequently each time the accommodation establishment was scraped.

S.06 Institutional Mandate

S.06 Institutional mandate

S.06.1 Legal acts and other agreements

There is no legal EU mandate regulating web scraping of websites. Each partner country has their own rules of using web scraping method to gather data. Please see more in the use case dedicated to legal aspects.

S.06. A Data access and data transmission

Access to raw data is based on public online availability. A few changes on the end-point APIs had been implemented to the code of web scraper and the new code sent to the partners as soon as possible.

S.07 Confidentiality

S.07.1 Confidentiality – policy

Not relevant for web scarped data since they are publicly available. Data gathered by web scraping is generally available, i.e. not considered confidential.

S.07.2 Confidentiality - data treatment

The data gathered by web scraping methods were not a subject to confidentiality as they were publicly available and scraped form worldwide available websites. Regulation 223/2009 on European statistics: Article 25: Data from public sources: Data obtained from sources lawfully available to the public and which remain available to the public according to national legislation shall not be considered confidential for the purpose of dissemination of statistics obtained from those data. Each country, if needed, introduced their rules of statistical confidentiality.

S.07.A1 Privacy Web scraped data may contain highly privacy-sensitive data of very small accommodations, i.e. natural persons (not enterprises) providing personal details. For these reasons, attention was paid to compliance based on confidentiality rules explained in S.07.2.

S.08 Release Policy

S.08 Release policy

S.08.A Release policy for Experimental Statistics

Not applicable for pilot project.

S.09 Frequency of Dissemination

S.09 Frequency of dissemination

S.10 Accessibility and Clarity

S.10 Accessibility and clarity

S.10. 6 Documentation on methodology

- Methodological manual for tourism statistics - J1 Deliverable of WPJ for the ESSnet on Big Data II: ESSnet Methods for web scraping, data processing and

analyses - J2 Deliverable of WPJ for the ESSnet on Big Data II: Interim technical report showing the preliminary results and a

general description of the methods used - J3 Deliverable of WPJ for the ESSnet on Big Data II: Methodological Framework Report

S.10. 7 Quality documentation

- J1 Deliverable of WPJ for the ESSnet on Big Data II: ESSnet Methods for webscraping, data processing and analyses - J2 Deliverable of WPJ for the ESSnet on Big Data II: Interim technical report showing the preliminary results and a

general description of the methods used - J3 Deliverable of WPJ for the ESSnet on Big Data II: Methodological Framework Report - K6 Deliverable of WPK for the ESSnet on Big Data II: Quality report template - final - K3 Deliverable of WPK for the ESSnet on Big Data II: Revised Version of the Quality Guidelines for the Acquisition and

Usage of Big Data - Final version

S.11 Quality Management

S.11 Quality management

S.11.1 Quality assurance Statistics Netherlands is certified according to ISO9001 since 2018. This means that a Quality management of

methodology and process development for official statistics has been adopted, audited and assessed. This certification

means that Statistics Netherlands focusses on:

the quality procedures for internal and external reports, recommendations and briefs;

the quality assurance of statistical development projects in which methodologists and business analysts participate;

the quality assurance of methodological courses taught to statisticians;

the internal management of the department These second and third points, in particular, provide the grounds to ensure the middle- and long-term of innovative projects such as the WP J on Tourism Statistics were met.

S.11.2 Quality assessment Each country assessed the quality of the results by themselves based on the rules and criteria applied in their country.

S.12 Relevance

S.12.1 User needs Tourism statisticians, touristic organizations, ministries and other public users.

S.12.3 Completeness Not applicable for pilot project.

S.12.A Added Value through new data source

Identification of tourist accommodation establishments not covered by the survey allows to improve (expand) the survey

frame of tourist accommodation establishments, provides more complete information about establishments operating

in the country, and thus improves the quality of results to obtain.

The use of web scraping techniques, and in particular platforms such as Hotels.com and Booking.com, allow accelerating the inventarisation process of tourist accommodations. The chosen platforms provide up-to- date inside information on the population dynamics of hospitality businesses and households.

S.13 Accuracy and Reliability

S.13 Accuracy and reliability

S.13. 1 Overall accuracy Each country assessed the quality of the results by themselves based on the rules and criteria applied in their country.

S.13. 2 Sampling error Not applicable for pilot project.

S.13. 3 Non- sampling error As for the coverage of NACE 55.3 (camping sites), the web scraping data does not cover this economic activity. In general, there are websites and platform specialized to and targeting this area of touristic activity. These sites have not been part of the pilot project.

Unit nonresponse - rate

Please see the chapter dedicated to use cases for more description.

Item nonresponse -

Not applicable for this pilot project as the units were not responding to a survey.

S.13. 3.1 Coverage error Undercoverage error occurred as only accommodation establishments available on booking portals will be gathered.

There is no possibility to calculate the percentage of undercoverage error due to the lack of data on how many

establishments were not present on chosen booking portals and are not covered by the survey on tourists’

accommodation statistics.

Overcoverage error in terms of duplicates occurred if the same accommodation establishment is included in two or more booking portals. This was treated by the use of geolocation data. Please see details in the chapter “Use cases”. Overcoverage also may occur when accommodation establishments advertise on the platforms that do not belong to the defined target population. This overcoverage is hard to detect with information from the platform only.

S.13.3.1.1 Overcoverage – rate Not applicable to this pilot project. Web scraped websites only offer units active in the business of accommodation renting.

S.13. 3.2 Measurement error The main measurements errors were:

- Scraper features: Changes on the location_id or street_addresses in platforms lead the scraper to fail and stop. The web scraped code was corrected in such instances.

- Target population: Platform (Hotels.com) filtered on proximity provoked that accommodations outside the population target got included in the scraping files, i.e. web scraped listings contained foreign accommodations, namely those close to the borders of either city or country. In such cases regular expressions to detect patterns on postcodes or special characters in language were used. Geolocation has been used to determine whether an accommodation belongs to a specific geographic area or not.

- Chained-brand Hotels & Resorts: Rapid changes in ownership of the Leisure and recreation industry required more flexible and adaptive approaches to detect and correct for merging, acquisition and franchise connected to multinationals such as Holiday Inn (UK), Best Western or Hilton (US), NH Hotel Group (ES), Jinjiang International (CN), Scandic Hotels(SE) and Accor(FR). No solution was found to this problem as there is very little information about platform policies to keep its registry of accommodations up-to-date.

S.13. 3.3 Nonresponse error The main issue were listings of companies with name and address of accommodations missing, sometimes even on purpose. Some listings instead of the name of the accommodation provided used this as a sort of description such us “Studio on a houseboat, near city centre!” or “Spacious, modern family home on the canal with parking”. The strategy was to leave this accommodation aside in a sort of “quarantine” group until new data were collected, analysed and assessed.

S.13.3.4 Processing error The applied method to properly connect establishments from the population frame to web scraped data used both probabilistic matching (based on latitude-and longitude coordinates (or distance) and accommodation names) and deterministic (based on names and address data) allowed to minimize the occurrence of processing errors.

S.13.3.5 Model assumption error

S.14 Timeliness and Punctuality

S.14 Timeliness and punctuality

S.14.1 Timeliness Web scraping allowed adding new units on a regular basis. Flash estimates in terms of accommodation establishments and their occupancy were also calculated at t+1.

S.15 Coherence and Comparability

S.15.1 Comparability - geographical

The data gathered is comparable between geographical areas. The exception are those areas close to borders when the listings from foreign accommodations being wrongfully assigned. Please see more explanation in 13. 3.2

S.15.2 Comparability - over time

If no legal problems occur or the web scrapped website disappear there will be no problem with comparability over time.

S.15.3 Coherence- cross domain

Not applicable for this pilot project.

S.15.4 Coherence – internal

S.15.A.1 Coherence - with existing information/ Official Statistics

The type of accommodation establishment included on booking portals often differ to the classification used in official

statistical. Machine learning methods were applied to assign proper type.

S.15.A.2 Comparability - between information from several distinct new data sources

The degree of the comparability between data sources depends on how well the establishments from different data sources were matched. When matched correctly the data describes the same establishment.

S.16 Cost and Burden

S.16 Cost and burden

S.16.A Potential savings in cost and burden

The direct costs decrease for the statistical department, as one seeks for efficiency and less manual work power to collect data on accommodation establishments.

The indirect costs to keep the crawlers updated and to link the data will lead to higher indirect costs for robust editing processes and can even lead to extra costs across other departments.

Short-term costs are connected to develop web crawlers.

Medium-term costs are those for testing crawlers, for deploying them in a special server and for training the operators.

Long-term costs cover the maintenance, update and operation of the servers, the crawlers and the trained operators.

S.17 Data Revision

S.17 Data revision Not applicable.

S.17.1 Data revision – policy

S.18 Statistical Processing

S.18 Statistical processing

S.18.1 Source data Combination of two sources of data were used: 1. survey on tourist accommodation establishments 2. web scraping of accommodation portals.

Those sources were combined by the variables of geolocation, address and name of the establishment and new accommodation establishments were identified and included to the survey frame

S.18.2 Frequency of data acquisition and recording

Frequency of web scraping depends on the purposes: for enriching the survey frame monthly updates (or even only every second or third month) are be sufficient, but for the flash estimates more frequent scrapings are necessary (in order to capture the short term changes of the price).

S.18.3 Data collection

S.18.4 Data validation

S.18.5 Data compilation

S.18.5.1 Imputation – rate

S.19 Comment

Annex 5 – R script to visNetwork object WPJ.2.PT ##========================================================##

## Network Visualization with R ##

## ESSnet Big Data II WPJ Use Case WPJ.04.PT ##

## Rui Alves ##

## Statistics Portugal ##

## Email: rui.alves@ine.pt ##

##========================================================##

# Load Packages -----------------------------------------------------------

# Packages should be installed and loaded before running the script

library(dplyr) # works with dplyr version 1.0.0 and R-4.0.2

library(visNetwork) # works with visNetwork version 2.0.9 and R-4.0.2

library(rstudioapi) # make sure you have it installed

# Create set_wd function --------------------------------------------------

set_wd <- function() {

# library(rstudioapi) # make sure you have it installed

current_path <- getActiveDocumentContext()$path

setwd(dirname(current_path ))

print( getwd() )

set_wd()

# ~~2. Run Code to create Dataframes ---------------------------------------------------------------

# ~~2.1 Create Nodes ---------------------------------------------------------------

nodes <- structure(list(id = c(2000230, 1000110, 7000110, 6000630, 6000640,

6000610, 2e+06, 2000220, 2000210, 3e+06, 2000223, 2000222, 2000211,

1, 3000330, 9000001, 9e+06, 8e+06, 8000001, 8000002, 8000003,

6000999), label = c("Credit & Debit Card transactions", "hotels",

"Hotels.com", "Improve quality of satellite accounts", "Improve quality of tourists expenses data",

"Improve tourist accommodation base", "Multi-Purpose Data", "NRT Local Accommodation",

"NTR Tourist Establishments", "Survey Data", "travelBI by Turismo de Portugal",

"Turismo de Portugal Local Accommodation Open Data", "Turismo de Portugal Tourist Establishments Open Data",

"Web Data", "Survey on tourist accommodation base", "Geolocation Tool (HERE Maps API)",

"Tools", "Variables", "Name, Address, PostalCode", "Name, Address, PostalCode",

"Coordinates", "Combining Data"), title = c("id:2000230 Credit & Debit card transactions",

"id:1000110 Hotels.com", "id:7000110 https://www.Hotels.com",

"id:6000630 Improve quality of satellite", "id:6000640 Improve quality of tourists expenses data",

"id:6000610 Improve tourist accommodation base of reference",

"id:2000000 Multi-Purpose", "id:2000220 National Tourist Registration Local Accommodation",

"id:2000210 National Tourist Registration Tourist Establishments",

"id:3000000 Survey Data", "id:2000223 Turismo de Portugal: TravelBI

https://travelbi.turismodeportugal.pt/en-us/Pages/Home.aspx",

"id:2000222 Turismo de Portugal Open Data: Local Accommodation

https://dadosabertos.turismodeportugal.pt/datasets/estabelecimentos-de-al",

"id:2000211 Turismo de Portugal Open Data: Tourist Establishment

https://dadosabertos.turismodeportugal.pt/datasets/empreendimentos-turisticos-existentes?geometry=-91.485%2C33.243%2C75.771%2C54.949",

"id:1 Web Scraped", "id:3000330 Survey on tourist accommodation base",

"id:9000001 Geolocation Tool (HERE Maps API)", "id:9000000 Tools",

"id:8000000 Variables", "id:8000001 Survey Data Name of accommodations Address Postal Code",

"id:8000002 Hotels.com Data Name of accommodations Address Postal Code",

"id:8000003 Latitue Longitude", "id6000999 Use Case WPJ.03.PT"

), group = c("Country: PT, Tourism Domain: Demand, Data Source: Multi-Purpose Data",

"Tourism Domain: Supply, Data Source: Web Scraped Data, Tourism Domain: Accommodation Base, Tourism Domain: Expenses, Variables",

"Tourism Domain: Supply, Data Source: Web Scraped Data, Tourism Domain: Accommodation Base, Tourism Domain: Expenses, External Links",

"Tourism Domain: Demand, Experimental Results", "Tourism Domain: Demand, Experimental Results",

"Tourism Domain: Supply, Experimental Results, Variables", "Data Source, Data Source: Multi-Purpose Data",

"Tourism Domain: Demand, Data Source: Multi-Purpose Data", "Tourism Domain: Demand, Data Source: Multi-Purpose Data",

"Data Source, Data Source: Survey Data", "Tourism Domain: Demand, Data Source: Multi-Purpose Data, External Links",

"Tourism Domain: Demand, Data Source: Multi-Purpose Data, External Links",

"Data Source, Data Source: Web Scraped Data", "Tourism Domain: Supply, Data Source: Survey Data, Variables",

"Tools, Variables", "Tools, Variables", "Variables", "Variables, Data Source, Data Source: Survey Data",

"Variables, Data Source, Data Source: Web Scraped Data", "Variables",

"Experimental Results, Variables"), value = c(4, 4, 2, 4, 4,

4, 6, 4, 4, 6, 2, 2, 2, 6, 4, 4, 6, 6, 4, 4, 4, 4), shape = c("dot",

"dot", "triangle", "square", "square", "square", "dot", "dot",

"dot", "dot", "triangle", "triangle", "triangle", "dot", "dot",

"diamond", "diamond", "dot", "dot", "dot", "dot", "square"),

color = c("orange", "orange", "lightblue", "purple", "purple",

"purple", "teal", "orange", "orange", "teal", "lightblue",

"lightblue", "lightblue", "teal", "orange", "darkgrey", "darkgrey",

"darkred", "darkred", "darkred", "darkred", "purple"), url = c(NA,

NA, "https://www.Hotels.com", NA, NA, NA, NA, NA, NA, NA,

"https://travelbi.turismodeportugal.pt/en-us/Pages/Home.aspx",

"https://dadosabertos.turismodeportugal.pt/datasets/estabelecimentos-de-al",

"https://dadosabertos.turismodeportugal.pt/datasets/empreendimentos-turisticos-existentes?geometry=-91.485%2C33.243%2C75.771%2C54.949",

NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA, 22L

), class = "data.frame")

# ~~2.2 Create Edges ---------------------------------------------------------------

edges <- structure(list(from = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1000110, 1000110, 1000110, 1000110, 1000110, 1000120, 1000120,

1000120, 1000120, 1000130, 1000130, 1000140, 1000140, 1000140,

1000150, 1000150, 1000150, 1000150, 1000160, 1000160, 1000160,

1000170, 1000180, 1000190, 1000201, 1000202, 1000203, 1000204,

2e+06, 2e+06, 2e+06, 2e+06, 2e+06, 2e+06, 2e+06, 2e+06, 2e+06,

2e+06, 2e+06, 2e+06, 2e+06, 2e+06, 2000210, 2000210, 2000210,

2000220, 2000220, 2000220, 2000230, 2000230, 2000240, 2000240,

2000250, 2000260, 2000270, 2000270, 2000280, 2000280, 2000290,

2000290, 2000290, 2000299, 2000299, 2000301, 2000302, 2000302,

2000303, 2000304, 2000304, 2000304, 2000305, 3e+06, 3e+06, 3e+06,

3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06, 3e+06,

3000310, 3000310, 3000320, 3000320, 3000330, 3000330, 3000330,

3000330, 3000330, 3000350, 3000360, 6000610, 6000610, 6000620,

6000620, 6000620, 6000630, 6000630, 6000640, 6000640, 6000640,

6000650, 6000650, 6000660, 6000701, 6000702, 1000190, 1000190,

1000110, 1000110, 1000120, 1000120, 1000120, 1000140, 1000140,

1000140, 2000297, 2000297, 2000297, 2000297, 2000297, 2000298,

2000298, 2000298, 2000298, 2000298, 3000370, 3000370, 3000370,

3000370, 3000370, 3000310, 3000310, 3000310, 3000330, 3000380,

3000380, 3000391, 3000391, 3000391, 3000392, 3000392, 3000392,

2000297, 1000130, 1000140, 1000201, 1000202, 1000203, 1000203,

1000204, 1000204, 2002001, 2002001, 2002001, 2002001, 2000295,

2000295, 2000295, 2000295, 1000202, 2000293, 2000293, 2000293,

2000292, 2000294, 2000294, 2000294, 3000340, 3000340, 1000170,

1000170, 1000170, 1000191, 1000191, 1000191, 1000191, 1000191,

1000191, 1000191, 2000292, 1000180, 3000310, 9e+06, 8e+06, 8e+06,

8000001, 8e+06, 8000003, 8000001, 8000002, 9000001, 3000330,

1000110, 8000001, 8000002, 8000003, 6000999), from_label = c("Web Data",

"Web Data", "Web Data", "Web Data", "Web Data", "Web Data", "Web Data",

"Hotels.com", "Hotels.com", "Hotels.com", "Hotels.com", "Hotels.com",

"Booking.com", "Booking.com", "Booking.com", "Booking.com", "Airbnb.com",

"Airbnb.com", "Tripadvisor.com", "Tripadvisor.com", "Tripadvisor.com",

"Skyscanner.net", "Skyscanner.net", "Skyscanner.net", "Skyscanner.net",

"taxi2airport.com", "taxi2airport.com", "taxi2airport.com", "365tickets.com",

"seatguru.com", "Pochivka.bg", "Other webpages (booking accommodation)",

"hrs.com", "holydaycheck", "pincamp", "Multi-Purpose Data", "Multi-Purpose Data",

"Multi-Purpose Data", "Multi-Purpose Data", "Multi-Purpose Data",

"NTR Tourist Establishments", "NTR Tourist Establishments", "NTR Tourist Establishments",

"NRT Local Accommodation", "NRT Local Accommodation", "NRT Local Accommodation",

"Credit & Debit Card transactions", "Credit & Debit Card transactions",

"Airport Data", "Airport Data", "Tax Authority Register of Short-Term Lease Properties",

"Register of Properties offered for short-term lease through digital platforms",

"Register of non-categorised accommodation", "Register of non-categorised accommodation",

"Register of Addresses and Buildings", "Register of Addresses and Buildings",

"NBTC-NIPO", "NBTC-NIPO", "NBTC-NIPO", "Other sources", "Other sources",

"Monthly bed tax data inc. overnights", "Financial transaction data",

"Financial transaction data", "Regulatory reporting FIN1-12",

"BTB data", "BTB data", "BTB data", "Register of accommodation establishments",

"Survey Data", "Survey Data", "Survey Data", "Survey Data", "Survey Data",

"Survey Data", "Survey Data", "Survey on participation of residents in trips",

"Survey on participation of residents in trips", "Survey on trips made by foreigners",

"Survey on trips made by foreigners", "Survey on tourist accommodation base",

"Survey on tourist accommodation base", "Survey on tourist accommodation base",

"Survey on the visits by foreigners", "Survey data on tourist trips",

"Improve tourist accommodation base", "Improve tourist accommodation base",

"Spatial disaggregation of accommodation", "Spatial disaggregation of accommodation",

"Spatial disaggregation of accommodation", "Improve quality of satellite accounts",

"Improve quality of satellite accounts", "Improve quality of tourists expenses data",

"Improve quality of tourists expenses data", "Improve quality of tourists expenses data",

"Flash estimates", "Flash estimates", "Improve quality data on trips",

"Update register of accommodation establishments", "Improve quality of tourists transport data",

"Pochivka.bg", "Pochivka.bg", "Hotels.com", "Hotels.com", "Booking.com",

"Booking.com", "Booking.com", "Tripadvisor.com", "Tripadvisor.com",

"Tripadvisor.com", "Border traffic survey (BI)", "Border traffic survey (BI)",

"Border traffic survey (BI)", "Border traffic survey (BI)", "Border traffic survey (BI)",

"Waste production (ISPRA)", "Waste production (ISPRA)", "Waste production (ISPRA)",

"Waste production (ISPRA)", "Waste production (ISPRA)", "Water Consumption",

"Water Consumption", "Water Consumption", "Water Consumption",

"Water Consumption", "Survey on participation of residents in trips",

"Survey on participation of residents in trips", "Survey on participation of residents in trips",

"Survey on tourist accommodation base", "Survey on rural tourism accommodations",

"Survey on rural tourism accommodations", "Railway, airport and port data",

"Railway, airport and port data", "Railway, airport and port data",

"Survey on museums and similar", "Survey on museums and similar",

"Survey on museums and similar", "Border traffic survey (BI)",

"Airbnb.com", "Tripadvisor.com", "Other webpages (booking accommodation)",

"hrs.com", "holydaycheck", "holydaycheck", "pincamp", "pincamp",

"Mobile phone data", "Mobile phone data", "Mobile phone data",

"Mobile phone data", "Water Demand", "Water Demand", "Water Demand",

"Water Demand", "hrs.com", "Smart City", "Smart City", "Smart City",

"Land Border Traffic", "Parking Meters Data", "Parking Meters Data",

"Parking Meters Data", "Border Traffic Survey", "Border Traffic Survey",

"365tickets", "365tickets", "365tickets", "nocowanie", "nocowanie",

"nocowanie", "nocowanie", "nocowanie", "nocowanie", "nocowanie",

"Land Border Traffic", "seatguru.com", "Survey on participation of residents in trips",

"Tools", "Variables", "Variables", "Name, Address, PostalCode",

"Variables", "Coordinates", "Name, Address, PostalCode", "Name, Address, PostalCode",

"Geolocation Tool (HERE Maps API)", "Survey on tourist accommodation base",

"hotels", "Name, Address, PostalCode", "Name, Address, PostalCode",

"Coordinates", "Combining Data"), to = c(1000110, 1000120, 1000130,

1000140, 1000150, 1000160, 1000170, 1000180, 1000190, 1000201,

1000202, 1000203, 1000204, 2000305, 6000610, 6000620, 6000650,

7000110, 2000305, 6000610, 6000620, 7000120, 6000610, 7000130,

6000630, 6000640, 7000140, 3000310, 3000320, 6000630, 7000150,

6000630, 6000640, 7000160, 7000170, 7000180, 7000190, 2000305,

7000191, 7000192, 7000193, 2000210, 2000220, 2000230, 2000240,

2000250, 2000260, 2000270, 2000280, 2000290, 2000292, 2000293,

2000294, 2000295, 2000296, 2000297, 2000298, 2000299, 2000301,

2000302, 2000303, 2000304, 2000305, 2002001, 2000211, 2000223,

6000610, 2000222, 2000223, 6000610, 6000630, 6000640, 6000630,

6000640, 6000610, 6000610, 2000271, 6000610, 2000281, 6000610,

2000291, 6000630, 6000640, 6000630, 6000640, 3000330, 6000640,

6000702, 3000330, 3000310, 6000673, 6000674, 6000701, 2002991,

2002992, 3000310, 3000320, 3000330, 3000340, 3000350, 3000360,

3000370, 3000380, 3000391, 3000392, 6000630, 6000640, 6000630,

6000640, 6000610, 6000620, 6000640, 6000650, 6000701, 6000660,

6000660, 6000611, 6000612, 6000611, 6000612, 6000630, 6000611,

6000612, 6000611, 6000612, 6000660, 6000611, 6000630, 6000630,

6000630, 6000660, 6000610, 6000650, 6000630, 6000640, 6000630,

6000640, 6000650, 6000610, 6000620, 6000650, 6000630, 6000640,

6000672, 6000671, 6000660, 6000660, 6000674, 6000671, 6000672,

6000673, 6000660, 6000674, 6000671, 6000672, 6000673, 6000660,

6000671, 6000674, 6000630, 6000610, 6000620, 6000660, 6000672,

6000674, 6000674, 6000671, 6000673, 6000674, 2000305, 2000305,

6000610, 6000610, 2000305, 6000610, 2000305, 6000610, 6000674,

6000671, 6000673, 6000672, 6000674, 6000671, 6000672, 6000673,

2000305, 6000671, 6000672, 6000673, 6000660, 6000671, 6000672,

6000673, 6000640, 6000660, 6000671, 6000672, 6000673, 6000610,

6000620, 2000305, 6000650, 6000640, 6000630, 7000194, 6000640,

6000660, 6000671, 9000001, 8000002, 8000001, 9000001, 8000003,

1000110, 3000330, 9000001, 8000003, 8000001, 8000002, 6000999,

6000999, 6000999, 6000610), to_label = c("Hotels.com", "Booking.com",

"Airbnb.com", "Tripadvisor.com", "Skyscanner.net", "taxi2airport.com",

"365tickets.com", "seatguru.com", "Pochivka.bg", "Other webpages (booking accommodation)",

"hrs.com", "holydaycheck", "pincamp", "Register of accommodation establishments",

"Improve tourist accommodation base", "Spatial disaggregation of data on tourist accommodation base",

"Flash estimates of the use of tourist accommodation base", "Hotels.com",

"Register of accommodation establishments", "Improve tourist accommodation base",

"Spatial disaggregation of data on tourist accommodation base",

"Booking.com", "Improve tourist accommodation base", "Airbnb.com",

"Tripadvisor.com", "Survey on participation of residents in trips",

"Survey on trips made by foreigners", "Improve quality of satellite accounts",

"Skyscanner.net", "Improve quality of satellite accounts", "Improve quality of tourists expenses data",

"taxi2airport.com", "365tickets.com", "seatguru.com", "Pochivka.bg",

"Register of accommodation establishments", "hrs.com", "holydaycheck",

"pincamp", "NTR Tourist Establishments", "NRT Local Accommodation",

"Credit & Debit Card transactions", "Airport Data", "Register of non-categorised tourist accommodation establishments",

"Tax Authority Register of Short-Term Lease Properties", "Register of Properties offered for short-term lease through digital

platforms",

"Register of Addresses and Buildings", "NBTC-NIPO", "Land Border Traffic",

"Smart City", "Parking Meters Data", "Water Demand", "Ministry of Interior",

"Border traffic survey (BI)", "Waste production (ISPRA)", "Other sources",

"Monthly bed tax data inc. overnights", "Financial transaction data",

"Regulatory reporting FIN1-12", "BTB data", "Register of accommodation establishments",

"Mobile phone data", "Turismo de Portugal Open Data", "travelBI by Turismo de Portugal",

"Improve tourist accommodation base", "Turismo de Portugal Local Accommodation Open Data",

"travelBI by Turismo de Portugal", "Improve tourist accommodation base",

"Improve tourist accommodation base", "Improve tourist accommodation base",

"Netherlands Chamber of Commerce", "Improve tourist accommodation base",

"BAG Viewer", "Improve tourist accommodation base", "NBTC Dashboard",

"Survey on tourist accommodation base", "Improve quality of tourists expenses data",

"Improve quality of tourists transport data", "Survey on tourist accommodation base",

"Survey on participation of residents in trips", "Tourism potential",

"Tourism attractiveness", "Update register of accommodation establishments",

"Railway, airport and port data", "Survey on museums and similar institutions",

"Survey on participation of residents in trips", "Survey on trips made by foreigners",

"Survey on tourist accommodation base", "Border Traffic Survey",

"Survey on the visits by foreigners", "Survey data on tourist trips",

"Water Consumption", "Survey on rural tourism accommodations",

"Railway, airport and port data", "Survey on museums and similar",

"Improve tourist accommodation base", "Spatial disaggregation of accommodation",

"Improve quality of tourists expenses data", "Flash estimates",

"Update register of accommodation establishments", "Improve quality data on trips",

"Improve quality data on trips", "Eurostat Experimental Statistics",

"WPJ Milestones and Deliverables", "Eurostat Experimental Statistics",

"WPJ Milestones and Deliverables", "Improve quality of satellite accounts",

"Eurostat Experimental Statistics", "WPJ Milestones and Deliverables",

"Improve quality data on trips", "Eurostat Experimental Statistics",

"Improve quality of satellite accounts", "Improve quality of satellite accounts",

"Improve quality of satellite accounts", "Improve quality data on trips",

"Improve tourist accommodation base", "Flash estimates", "Improve quality of satellite accounts",

"Improve quality of tourists expenses data", "Improve quality of satellite accounts",

"Improve quality of tourists expenses data", "Flash estimates",

"Flash estimates", "Improve quality of satellite accounts", "Improve quality of tourists expenses data",

"Event related tourism", "City Tourism", "Improve quality data on trips",

"Improve quality data on trips", "Tourism attractiveness", "City Tourism",

"Event related tourism", "Tourism potential", "Improve quality data on trips",

"Tourism attractiveness", "City Tourism", "Event related tourism",

"Tourism potential", "Improve quality data on trips", "City Tourism",

"Tourism attractiveness", "Improve quality of satellite accounts",

"Improve quality data on trips", "Event related tourism", "Tourism attractiveness",

"Tourism attractiveness", "City Tourism", "Tourism potential",

"Tourism attractiveness", "Register of accommodation establishments",

"Register of accommodation establishments", "Improve tourist accommodation base",

"Improve tourist accommodation base", "Register of accommodation establishments",

"Improve tourist accommodation base", "Tourism attractiveness",

"City Tourism", "Tourism potential", "Event related tourism",

"Tourism attractiveness", "City Tourism", "Event related tourism",

"Tourism potential", "Register of accommodation establishments",

"City Tourism", "Event related tourism", "Tourism potential",

"Improve quality data on trips", "City Tourism", "Event related tourism",

"Tourism potential", "Improve quality of tourists expenses data",

"Improve quality data on trips", "City Tourism", "Event related tourism",

"Tourism potential", "Improve tourist accommodation base", "Spatial disaggregation of accommodation",

"Register of accommodation establishments", "Flash estimates",

"Improve quality of tourists expenses data", "Improve quality of satellite accounts",

"nocowanie.pl", "Improve quality of tourists expenses data",

"Improve quality data on trips", "City Tourism", "Geolocation Tool (HERE Maps API)",

"Name, Address, PostalCode", "Name, Address, PostalCode", "Geolocation Tool (HERE Maps API)",

"Coordinates", "hotels", "Survey on tourist accommodation base",

"Geolocation Tool (HERE Maps API)", "Coordinates", "Name, Address, PostalCode",

"Name, Address, PostalCode", "Combining Data", "Combining Data",

"Combining Data", "Improve tourist accommodation base"), length = c(250,

250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250,

250, 250, 250, 5, 250, 250, 250, 5, 250, 5, 250, 250, 5, 250,

250, 250, 5, 250, 250, 5, 5, 5, 5, 250, 5, 5, 5, 250, 250, 250,

250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250,

250, 250, 250, 250, 250, 250, 250, 5, 5, 250, 5, 5, 250, 250,

250, 250, 250, 250, 250, 5, 250, 5, 250, 5, 250, 250, 250, 250,

250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250,

250, 250, 250, 250, 250, 5, 5, 5, 5, 250, 5, 5, 5, 5, 250, 5,

250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250,

250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 5,

250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250, 250,

250, 250, 250, 250, 250), width = c(1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1), label = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,

NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,

NA, "string distance", "string distance", "1-1 match", NA), arrows = c("FALSE",

"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",

"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "to", "middle;to",

"middle;to", "to", "FALSE", "to", "to", "to", "FALSE", "to",

"FALSE", "to", "to", "FALSE", "to", "to", "to", "FALSE", "to",

"to", "FALSE", "FALSE", "FALSE", "FALSE", "to", "FALSE", "FALSE",

"FALSE", "FALSE", "FALSE", "FALSE", "FALSE", "to", "FALSE", "FALSE",

"to", "to", "to", "to", "to", "to", "to", "FALSE", "to", "FALSE",

"to", "FALSE", "to", "to", "to", "to", "to", "to", "to", "to",

"to", "to", "to", "to", "FALSE", "FALSE", "FALSE", "FALSE", "FALSE",

"to", "to", "to", "to", "to", "to", "to", "to", "to", "to", "to",

"FALSE", "FALSE", "FALSE", "FALSE", "to", "FALSE", "FALSE", "FALSE",

"FALSE", "to", "FALSE", "to", "to", "to", "to", "to", "to", "to",

"to", "to", "to", "to", "to", "to", "to", "to", "to", "to", "to",

"to", "to", "to", "to", "FALSE", "to", "to", "to", "FALSE", "FALSE",

"FALSE", "to", "FALSE", "to", "to", "to", "to", "to", "to", "to",

"to", "to", "to")), row.names = c(NA, 223L), class = "data.frame")

# Create VisNetwork object ------------------------------------------------

# Create VisNetwork object

# "height" and "width" are optional and the default is automatic.

# "height" can be customized in pixels or as a percentage

# according your display resolution.

# "main" and "submain" define the title and subtitle.

visNet_Workflow_var_map <- visNetwork(nodes, edges,

# height = "800px",

height = "1000px",

width = "100%",

main="Overview of Variable Mapping on the PT Use Case",

submain="[PT]") %>%

# ~~Create Legend ------------------------------------------------

# Create a user defined VisNetwork Legend

visLegend(useGroups = FALSE,

width = 0.3,

# width = 0.2,

position = "right",

addNodes = data.frame(

label = c("Data Sources Type", "Data Sources", "Experimental Results", "External links", "Tools", "Variables"),

shape = c("dot", "dot", "square", "triangle", "diamond", "dot"),

color = c("teal", "orange", "purple", "lightblue", "lightgrey", "darkred"))) %>%

# ~~Interaction ----------------------------------------------

# Added functionalities for user interaction: navigations buttons and multi selection.

# Green navigation buttons are placed on the bottom left and bottom right of the network canvas.

# These will help to navigate, zoom in and out and re-center.

# Multi-selection: a long click as well as a control-click will add to the selection.

visInteraction(navigationButtons = TRUE, multiselect = T ) %>%

# ~~Export Current Network Canvas to PNG -------------------------------------------

# A small button with "Export to PNG" text will be placed on the bottom right corner

# of the network canvas. On click, a PNG file (the default) will be exported to downloads folder.

# Can be configured to "jpeg" or "pdf"

visExport(type = "png") %>%

# ~~Use igraph Layout -------------------------------------------

# Using a igraph layout allows to compute coordinates and fast rendering.

# The network will be rendered faster and with no stabilization.

# This is an option but will over run default settings for VisNetwork

# visIgraphLayout(type = "full") %>%

# ~~Default Format for Nodes -----------------------------------------

# Nodes will get a dot shape by default if none is provided in

# nodes$shape.

visNodes(shape = "dot",

color = list(

highlight = "#FF8000"),

shadow = list(enabled = TRUE, size = 10),

labelHighlightBold = T) %>%

# ~~Default Format for Edges -----------------------------------------

visEdges(shadow = FALSE,

color = list(color = "#0085AF", highlight = "#C62F4B"),

# smooth = list(enabled = F), # to generate straight lines

arrows = 'to') %>%

# ~~General Options (combo box)----------------------------------------------------------

# Creates two combo boxes for selection of nodes. One based on nodes id label (nodes$label)

# and another based on multiple groups per nodes using a comma as it is the case

# of nodes$group

# visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),

visOptions(highlightNearest = list(enabled = T, degree = 2, hover = T),

manipulation = TRUE,

# First combo box "Select by id"

# Defined by nodesIdSelection can be "values : Optional. Vector of possible values (node's id),

# and so order is preserve. Default to all id in nodes data.frame.

nodesIdSelection = list(enabled = T, selected = "1"),

# Second combo box "Select by group"

# Defined by selectedBy, uses multiple "categories" separated by a comma in nodes$group

selectedBy = list(variable = "group", multiple = TRUE)) %>%

# ~~Layout Options ----------------------------------------------------------

# When not using hierarchical layout, providing random seed manually

# will produce the same layout every time.

visLayout(randomSeed = 11, improvedLayout = T, hierarchical = F) %>%

# ~~Open URL in New Window--------------------------------------------------

# This fires an event when a node is selected and double clicked on.

# In this case the event is open url as defined in nodes$url in a new

# browser window.

# visEvents(selectNode =

visEvents(doubleClick =

"function(params) {

var nodeID = params.nodes[0];

var url = this.body.nodes[nodeID].options.url;

window.open(url, '_blank');

}") %>%

# ~~Clustering Options OFF--------------------------------------------------

# By default, clustering is performed based on nodes$group.

# Nevertheless in this case that variable has multiple groups per node,

# so we can define on which nodes should the clustering be done.

# This helps to keep the initial network declutered.

# Clusters can be expanded by double click and can be reinitialized by

# clicking the "Reinitialize clustering" in the lower left corner of

# the canvas

# visClusteringByConnection(nodes = c(1,2,3)) %>%

# ~~Physics Options--------------------------------------------------

# Configuration of the physics system governing the simulation

# of the nodes and edges.

# BarnesHut is the recommended solver for non-hierarchical layout.

# The remaining parameters were fine tuned for this particular network.

# visPhysics(maxVelocity = 5,

# solver = "barnesHut",

# barnesHut = list(avoidOverlap = 0.15,

# gravitationalConstant = -1500,

# springConstant = 0.01),

# repulsion = list(centralGravity = 1.5))

visPhysics(maxVelocity = 5,

solver = "barnesHut",

barnesHut = list(avoidOverlap = 0.15,

# centralGravity = 1,

gravitationalConstant = -1500,

springConstant = 0.01),

repulsion = list(centralGravity = 1.5))

# Show result on RStudio viewer-----------------------------------------------------------

visNet_Workflow_var_map

# Save visNetwork to HTML File -----------------------------------------------------------

# Save the visNetwork object to a single self-contained HTML file for sharing.

visSave(visNet_Workflow_var_map , file = "visNet_Workflow_var_map.html")

# ================ |-------------| ================

Annex 6 – R script to visNetwork object (detail)

Annex 7 – CBS taxonomy

CBS Nederlands toerisme CBS English tourism

toerisme @ CBS taxonomie tourism @ CBS taxonomy

Broader Terms Broader Terms

BT recreatie en cultuur BT recreation and culture

More specific terms More specific terms

NT2 accommodatiegebruik NT business travel

NT2 bestedingen toeristen NT holidays by Dutch people

NT2 logiesaccommodaties NT overnight accommodation

NT2 logiesvormen NT tour group

NT2 reisgezelschap NT tourism accounts

NT2 toerismebeleid NT tourism policy

NT2 toerismerekeningen NT tourist areas

NT2 toeristen NT tourist expenditure

NT2 toeristengebieden NT tourists

NT2 vakanties van Nederlanders NT types of accommodation

NT2 zakelijk toerisme NT use of accommodation

Related terms Related terms

RT reisbranche RT travel industry

Non preferred terms Non preferred terms

UF vreemdelingenverkeer UF foreigners traffic

CBS Nederlands gebruik van accommodaties CBS English use of accommodation

accommodatiegebruik @ CBS taxonomie use of accommodation @ CBS taxonomy

Broader Terms Broader Terms

BT toerisme BT tourism

More specific terms More specific terms

NT3 overnachtingen NT length of stay

NT3 verblijfsduur NT overnight stays

CBS Nederlands bestedingen toeristen CBS English tourist expenditure

bestedingen toeristen @ CBS taxonomie tourist expenditure @ CBS taxonomy

Non preferred terms Non preferred terms

UF toeristische bestedingen UF tourist spending

Annex 8 – Process for data linkage

Step-by-step process:

1. Download and installation of “Fuzzy Lookup Add-In For Excel24”

2. Insert two datasets in different excel sheets (scraped data and survey frame of tourist

accommodation establishments). Prepare scraped data removing blanks (i.e. from postal

code) or non-text characters Write a new VBA function “translit” and use it to transliter hotel’s

name column to capital Latin characters. This function uses two arguments: inchar and exchar.

The inchar definition is a column (or row) in the sheet that contains one by one the characters

to be replaced in the text and the exchar argument is another line (or column) that contains

the characters that will replace them.

The vba code is the following:

Function translit(keimeno As String, inchar As Variant, _

exchar As Variant) As String

Application.Volatile True

Dim VarrAs Variant

Dim pl As Integer, gr As Integer, lu As Integer

Dim gramma As String

pl = Len(keimeno)

ReDimVarr(pl - 1)

For gr = 1 Topl

gramma = Mid(keimeno, gr, 1)

For lu = 1 To inchar.Count

If gramma = inchar(lu) Then gramma = exchar(lu): Exit For

Varr(gr - 1) = gramma

transfer = Join(Varr, "")

End Function

3. A joint database of establishments resulted by linking scraped with Register data is the final

output of the methodology for the data linkage process.

4. Evaluation of the linkage process by calculating the confusion matrix.

24 The Fuzzy Lookup Add-In for Excel performs fuzzy matching of textual data in Microsoft Excel. It can be used

to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables.

The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and

added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy

Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default

configuration works well for a wide variety of textual data, the matching may also be customized for specific

domains or languages.

The Transliteration system: UN/ELOT was used as follows:

Transliteration system: UN/ELOT

The United Nations recommended system was approved in 1987 (resolution V/19), based on

the ELOT 743 conversion system of the Greek Standardization Organization.

ESSnet Big Data II - European Commission

Documents

Transcript of ESSnet Big Data II - European Commission

ANNEX 2 - European Commission

2021 – 2027 - European Commission

Untitled - European Commission

Equap ebook - European Commission

EN EN - European Commission

Glossário 2.0 - European Commission

ENPowerToolkit-EN.pdf - European Commission

pesticides_mrl_guidelines_honey.pdf - European Commission

europa-kommissionen - European Commission

BOLETÍN LATINOAMERICANO - European Commission

Animal Health - European Commission

2I9-eiddhh! - European Commission

Compendium of HICP - European Commission

COMMISSION - EUR-Lex - European Union

EuroVeg - European Commission

EUROPEAN COMMISSION - EUR-Lex

TARGET MARKET: CHINA - European Commission

commission of the european communities

Water Accounts - European Commission

CPNP - European Commission