a collaborative approach to censorship circumvention

88
MultiProxy: a collaborative approach to censorship circumvention Gaomei Shi

Transcript of a collaborative approach to censorship circumvention

MultiProxy: a collaborative approach to censorshipcircumvention

Gaomei Shi

MultiProxy: a collaborative approach to censorshipcircumvention

Master’s Thesis in Computer Science

Distributed Systems groupFaculty of Electrical Engineering, Mathematics, and Computer Science

Delft University of Technology

Gaomei Shi

28th March 2019

AuthorGaomei Shi

TitleMultiProxy: a collaborative approach to censorship circumvention

MSc presentation29th March 2019

Graduation CommitteeDr. sc. ETH J.S.Rellermeyer Delft University of TechnologyDr. ir. J.A.Pouwelse Delft University of TechnologyDr. -Ing. T.Fiebig Delft University of Technology

Abstract

In recent years, many countries and administrative domains exploit control overtheir communication infrastructures to censor online materials. The concrete reas-ons behind the Internet censorship remain poorly understood due to the opaquenature of the systems. Generally, Internet censorship is to disrupt the free flowof information. It involves a series of steps to stop the dissemination of inform-ation, or prevent the access to information, for example, disrupt the link betweenthe users and providers. These technologies bring significant inconvenience for le-gitimate users. The goal of the thesis is to undertake a recent study to measure thebehavior of the Great Firewall of China (GFW). Based on that, this work designsa Peer-to-Peer (P2P) circumvention system called MultiProxy which exploits theblockchain-based economical model in order to create a balanced environment forresources providing and consuming. The system also uses multi-hop messagingto protect the privacy of the request initiators. The evaluation results show thatMultiProxy can evade censorship while protecting users privacy.

iv

Preface

Censorship is existing and prevalent with the advent of the Internet. It is likea double-edged sword which has both positive and negative impact on the gen-eral public. For instance, the Internet censorship limits the bad information fromspreading, while at the same time it also restricts the access according to prefer-ences of regimes, and this can cause inconvenient for netizens. Take the GFW, theworld’s largest country-wide Internet censorship system as an example. There arefew numbers of formal documentation about the operational principles under sucha sophisticated system. Therefore, I think its patterns are deserved to be explored,and with knowledge about the working mechanisms of the GFW, a countermeasurecould be further designed.

It was very pleasant to be able to work on this exciting and challenging researchtopic in the Distributed Systems group. First of all, I would like to thank my super-visor Jan, I would not have been able to complete the underlying work without hisexcellent scientific guidance. I would also like to thank Martijn for his assistanceand support on the project API usage and coding. Finally, I would like to expressmy sincerest gratitude to my family for their unconditional supporting, encourage-ment and motivational capabilities.

Gaomei Shi

Delft, The Netherlands

18th March 2019

v

vi

Contents

Preface v

1 Introduction 51.1 A brief history of the GFW . . . . . . . . . . . . . . . . . . . . . 51.2 The categories of circumvention systems . . . . . . . . . . . . . . 61.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Problem description 92.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Internet Censorship . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Client-side censorship . . . . . . . . . . . . . . . . . . . 102.2.2 Server-side censorship . . . . . . . . . . . . . . . . . . . 112.2.3 In-path censorship . . . . . . . . . . . . . . . . . . . . . 112.2.4 On-path censorship . . . . . . . . . . . . . . . . . . . . . 11

2.3 Analysis and Blocking Mechanisms . . . . . . . . . . . . . . . . 112.3.1 In-path censorship . . . . . . . . . . . . . . . . . . . . . 122.3.2 On-path censorship . . . . . . . . . . . . . . . . . . . . . 13

2.4 Obfuscation of censorship circumvention systems . . . . . . . . . 162.4.1 Payload encryption . . . . . . . . . . . . . . . . . . . . . 172.4.2 Randomizer . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.3 Mimicry . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.4 Tunneling . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 P2P architecture - Build a trust network to overcome censorship . 20

3 Empirical evaluation of the GFW 233.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 IP blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 TCP connection reset . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 HTTP keywords detection . . . . . . . . . . . . . . . . . 273.3.2 DNS keywords detection . . . . . . . . . . . . . . . . . . 283.3.3 TCP connection reset module . . . . . . . . . . . . . . . 31

3.4 DNS hijacking and DNS cache poisoning . . . . . . . . . . . . . 323.4.1 How the GFW intercepts DNS resolution . . . . . . . . . 32

vii

3.5 Experimental results and summary . . . . . . . . . . . . . . . . . 353.5.1 Threat model . . . . . . . . . . . . . . . . . . . . . . . . 353.5.2 Results and Circumvention suggestions . . . . . . . . . . 35

4 Circumvention System Design 374.1 Design goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Traffic forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3.1 SOCKS on edges . . . . . . . . . . . . . . . . . . . . . . 404.3.2 Peer-to-peer system . . . . . . . . . . . . . . . . . . . . . 42

4.4 Token economy . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.4.1 Analysis of threats to the system . . . . . . . . . . . . . . 444.4.2 Solutions for threats . . . . . . . . . . . . . . . . . . . . 46

4.5 Multi-hop messaging . . . . . . . . . . . . . . . . . . . . . . . . 524.5.1 Solutions for data privacy . . . . . . . . . . . . . . . . . 52

4.6 Implementation details . . . . . . . . . . . . . . . . . . . . . . . 534.6.1 Traffic forwarding . . . . . . . . . . . . . . . . . . . . . 534.6.2 Token economy . . . . . . . . . . . . . . . . . . . . . . . 544.6.3 Anonymous messaging . . . . . . . . . . . . . . . . . . . 55

5 Evaluation 575.1 Evaluation framework . . . . . . . . . . . . . . . . . . . . . . . . 57

5.1.1 Network performance . . . . . . . . . . . . . . . . . . . 575.1.2 System performance . . . . . . . . . . . . . . . . . . . . 58

5.2 Methodologies and experimental steps . . . . . . . . . . . . . . . 595.2.1 System performance . . . . . . . . . . . . . . . . . . . . 595.2.2 Performance Comparison . . . . . . . . . . . . . . . . . . 60

5.3 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . 615.3.1 Network performance . . . . . . . . . . . . . . . . . . . 615.3.2 System performance . . . . . . . . . . . . . . . . . . . . 635.3.3 Scalability test . . . . . . . . . . . . . . . . . . . . . . . 66

6 Conclusions and Future Work 676.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.1.1 Results for each research questions . . . . . . . . . . . . 676.1.2 Main contributions . . . . . . . . . . . . . . . . . . . . . 71

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

viii

List of Figures

2.1 An overview of censorship system . . . . . . . . . . . . . . . . . 102.2 A general censorship model . . . . . . . . . . . . . . . . . . . . . 112.3 DNS hijacking and DNS poisoning . . . . . . . . . . . . . . . . . 16

3.1 Experimental pipeline for categorizing different root causes . . . . 253.2 IP blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 TCP connection reset . . . . . . . . . . . . . . . . . . . . . . . . 273.4 The model of TCP connection reset device . . . . . . . . . . . . . 313.5 DNS hijacking . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.6 DNS cache poisoning . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 A Tor circuit with pluggable transport meek . . . . . . . . . . . . 384.2 The average connected clients number of cymrubridge02[32] . . . 394.3 Circumvention system architecture . . . . . . . . . . . . . . . . . 394.4 Traffic routing path . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 Work flow of the SOCKS5 protocol . . . . . . . . . . . . . . . . 424.6 Bootstrapping of a peer-to-peer system . . . . . . . . . . . . . . . 434.7 Challenge response mechanism . . . . . . . . . . . . . . . . . . . 454.8 A malicious server node . . . . . . . . . . . . . . . . . . . . . . 454.9 Trustchain protocol[20] . . . . . . . . . . . . . . . . . . . . . . . 474.10 Intel SGX Remote attestation[19] . . . . . . . . . . . . . . . . . 494.11 Intel SGX Remote attestation full work flown[19] . . . . . . . . . 504.12 A 2-hop onion routing circuit . . . . . . . . . . . . . . . . . . . . 524.13 The class UML diagram of Multiproxy . . . . . . . . . . . . . . . 534.14 Components and work flow of MultiProxy . . . . . . . . . . . . . 544.15 Packet structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 544.16 Build a 2-hop circuit[37] . . . . . . . . . . . . . . . . . . . . . . 56

5.1 Premium networking topology of Google Cloud instances[21] . . 605.2 Latency with different hop length . . . . . . . . . . . . . . . . . . 625.3 Latency measurement . . . . . . . . . . . . . . . . . . . . . . . . 635.4 Throughput measurement . . . . . . . . . . . . . . . . . . . . . . 645.5 CPU usage (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.6 Memory usage (MB) . . . . . . . . . . . . . . . . . . . . . . . . 65

1

5.7 Scalability test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2

List of Listings

1 Traceroute from uncensored domain . . . . . . . . . . . . . . . . 262 Traceroute from censored domain . . . . . . . . . . . . . . . . . 263 Packet capture example of TCP connection reset over HTTP protocol 284 Packet capture example of TCP connection reset over HTTP pro-

tocol during a certain period . . . . . . . . . . . . . . . . . . . . 295 Packet capture example over TCP . . . . . . . . . . . . . . . . . 306 Packet capture example over UDP . . . . . . . . . . . . . . . . . 32

List of Tables

3.1 Poisoned IP addresses . . . . . . . . . . . . . . . . . . . . . . . . 343.2 The proportion of poisoned domain names . . . . . . . . . . . . . 353.3 The number of domains affected by each blocking mechanism . . 36

4.1 The IP addresses of meek server . . . . . . . . . . . . . . . . . . 384.2 MultiProxy system components . . . . . . . . . . . . . . . . . . 41

5.1 Evaluation framework . . . . . . . . . . . . . . . . . . . . . . . . 57

3

4

Chapter 1

Introduction

This chapter gives the background of the censorship. Censorship can occur invarious types of media for a variety of reasons, such as politics, religions, copyapproval, etc. This thesis does not focus on the root cause of censorship. Instead,it emphasizes the technical aspects of Internet censorship. Since different coun-tries have different implementations of the censorship systems, this work takes thecountry-wide content monitoring system, the GFW as a case study. Section 1.1introduces the brief history of the Great Firewall of China (GFW). Following this,a short introduction of different circumvention methods and systems are presentedin section 1.2. Section 1.3 outlines the thesis structure.

1.1 A brief history of the GFW

The censorship already exists for many years according to different rules and reg-ulations as the Internet has become a common communication platform. Take themost famous example: The GFW, also known as the Great Firewall of China, isthe most extensive and sophisticated country-wide Internet censoring and monitor-ing system around the world. It is a combination of the hardware and softwarewhich aims at distinguishing and blocking the network traffic in the particularblacklist. This undesired web content contains search engines, e.g., Google andDuckDuckGo, social media and social networking websites, e.g., Twitter, Face-book, Instagram, YouTube, etc. The GFW uses multiple techniques and modulesto prevent the citizens from accessing the blocked contents.

In the last few decades, most application layer protocols such as HTTP andDNS are directly used based on the transport layer protocols, e.g., TCP and UDP.Although these application protocols provide the standard ways of transferring re-sources, they are vulnerable to man-in-the-middle attacks because of their insuf-ficient considerations of security. All the network traffic is in plain text between

5

the source and the destination. The lack of security gives the GFW a chance todetect the contents inside the specific protocols. In the early years from 2002[41],the GFW starts developing the keyword filtering system to block the access to se-lected target websites including some search engines and social media websiteswhich can spread a massive amount of information. The main techniques includ-ing HTTP keyword detection, TCP connection reset, DNS hijacking, and DNSpoisoning. Some simple TCP and HTTP proxies do not work because of thesetechniques. The GFW also use IP address blocking to prevent netizens from ac-cessing the poisoned websites by its correct IP addresses.

Later, some censorship circumvention tools are developed to evade Internet sur-veillance, e.g., the VPN and proxies, e.g., HTTP proxy and HTTPS proxy. Thereare also some hidden services, and anonymous peer-to-peer communication sys-tems are developed, e.g., the Tor Project. After a long period of the arms racebetween the GFW and censorship circumvention systems, the GFW starts to block-ing and filtering the specific network traffic, e.g., the SSH protocol and the Open-VPN protocol. Furthermore, once the circumvention services are detected, theseservice providers such as cloud instances are IP-blocked by the GFW.

In addition to that, the GFW has abilities to discover the hidden circumventionservices. Ensafi et al.[11] discover that the GFW using the probing mechanismssuch as sending the random binary data in every 15 minutes to detect and blocksTor’s bridges.

In recent years, the GFW starts to adopt the HTTPS certificate detection andServer Name Indication (SNI) detection against the HTTPS traffic to the blockedwebsites, but it has not been used on a large scale[41]. The reason that mechan-isms works are because the target server addresses can still be identified inside themessages.

Although the GFW can block harmful websites and restrict the spread of violentand criminal information, some profitable and useful sites are also blocked. Thereare some replacements of these blocked websites, for example, China has its searchengine called Baidu, but it is not very accurate at searching the Foreign news, andliterature. Therefore the existence of the GFW builds up a barrier for netizens inthe censored domain to use the legal service to a certain extent.

1.2 The categories of circumvention systems

There are some censorship circumvention systems aim at helping people inside thecensored domain retrieve legal and useful information from the blocked websites.Most systems are client-server based architecture. For instance, the VPN solutions,which works at the network layer, the practical applications including OpenVPNand OpenConnect. Another is the proxies work on the transport layer, the popular

6

applications including the Shadowsocks and V2Ray. The client-server systemshave the intuitive, simple structure and easy to deploy and use, but the drawback isobvious, all clients rely on the existence of the servers, once the servers are takendown by the GFW, the solution is to buy and configure another cloud instance toset up a server. Another disadvantage is that if the server nodes are malicious,privacy and security during the browsing process is hard to guarantee. Anotherkind of the systems is the anonymous communication systems which make use ofthe onion routing[33] to build a secure data tunnel in between, e.g., the Tor Project.Compared to the client-server based systems, this kind of systems can avoid thesingle point of failure, since if the GFW blocks one network circuit, the systemscan change to other tunnels. In addition to that, the systems can also prevent theidentity and privacy of the request originators, because of the nodes in the circuitact as the traffic forwarders, and has the limited range of knowledge, such as onlyknow the identities of previous and next nodes. The exit nodes, as the last hopof the Tor network, will decrypt the messages and act as the client to build theconnection between the target websites. The location and number of exit nodes isthe key point of these systems. In order to achieve high performance such as lowlatency and high throughput, the system needs to balance the number of serviceconsumers and service providers. The bridges and exit nodes of Tor are set up byvolunteers, and it does not have the solutions for this problem so far.

1.3 Thesis Structure

The goal of the thesis is to undertake a recent study to measure the behavior ofGFW. Furthermore, the threat model of the GFW is created after measuring itsactual behaviors, to design, develop and evaluate a systematic approach to evadethe censorship. This thesis introduces the MultiProxy, a peer-to-peer based proxysystem for privacy web accessing. Compared to the current proxy or VPN client-server based systems, the design concepts of the MultiProxy exploits the resourcesof a group of nodes. It has three major functionalities. First, it aims at providingthe basic circumvention services, and the traffic can be forwarded through differentnodes. Second, the system introduces the token economy, which means everyoneparticipates in the network needs to pay or earn tokens according to their identities.Therefore, MultiProxy can balance the number of consumers and circumventionservice providers. Compared to Tor’s free use of bridges and exit nodes, the Multi-Proxy can build a robust and healthy environment for users to achieve low latencyand high throughput. The system also takes advantage of onion routing for privacyconsideration of the request originators. The MultiProxy selects a group of can-didates nodes to build tunnels for transmitting data between the originators and thetarget websites.

This thesis has four major parts. The research questions are given in the Chapter2, and to have a basic understanding of the censorship techniques, this chapter

7

also aims at giving some background of the GFW and Censorship resistance sys-tem (CRS) including the different methods and architectures from the literature.Chapter 3 will give a comprehensive measurement of the GFW and look into howcensorship in China works, e.g., the phenomenal and operating principle behindthe different blocking methods, and the threat model of the GFW is given. Chapter4 is to design and implement a novel system called MultiProxy to circumvent thecensorship, based on the gained insights. The evaluation and improvement is in theChapter 5. Finally, the conclusion and future works are described in Chapter 6.

8

Chapter 2

Problem description

Section 2.1 describes five research questions the thesis aims to solve. The primarycontent blocking techniques and the anti-censorship methods are summarized.

Before the experiment steps, a comprehensive literature survey about censorshipand anti-censorship methods is conducted. Nearly 60 papers are selected and clas-sified into different categories, and the results are summarized in the section 2.2to 2.5. The definition of Internet censorship is given in section 2.2. The analysisand blocking mechanisms are discussed in section 2.3. In section 2.4, the differentobfuscation methods which are adopted by censorship circumvention systems arepresented. The peer-to-peer architecture is discussed in section 2.5.

2.1 Research Questions

The goal of the thesis is to provide an efficient and robust way for people inside thecensored domain to retrieve useful legal information. To achieve this, the followingfive research questions need to be addressed:

• What are the current content blocking techniques?

• What are the current anti-censorship methods?

• How can an effective way for evading censorship be developed?

• How can the performance and effectiveness of censorship evasion systemsbe evaluated?

• What are the lessons and recommendations for circumvention?

The first two questions aim at investigating the current content blocking tech-niques and anti-censorship techniques. The following two questions focused on

9

design and evaluate efficient censorship systems. The recommendations for cir-cumventions are given in the last chapter.

2.2 Internet Censorship

In order to solve the first question, the definition of the censorship needs to beclarified. The communication model of client and server is shown in Figure 2.1,Internet censorship can take place in TCP/IP stack of both endpoints as well as inpath between them. Thus censorship can be classified into client-side censorship,server-side censorship, in-path censorship and on-path censorship.

Figure 2.1: An overview of censorship system

2.2.1 Client-side censorship

Client-side censorship means users can only retrieve a limited size of resources dueto built-in functionalities in the censorship applications, such as the network filters.These applications(e.g., TOM-Skype, Sina-UC, LINE, Green Dam) can disrupt thenormal connections in many ways, for example, by showing wrong results whenthe user triggers some sensitive words or URLs in the blacklist or disallowing soft-ware installation. The general methods to measure client-side censorship includebuilding sensitive keyword lists or reverse engineering the application.

10

2.2.2 Server-side censorship

In the server-side implementation, the rules are run in the remote server, whichcan be selectively removed, hide or block access to specific content according toregulations. Similar to client-side censorship detection, researchers try to build alist of content as testing samples, send these samples to servers and record whetherthese contents are blocked. Zhu et al.[47] find that Weibo, a China’s Twitter-likeservice, delete most sensitive posts within a day by using retrospective keyword-based mechanism. Similarly, Wechat, the dominant chat application in China, alsoapply keyword and URL filtering in one-to-one chat as well as group chat. Userswill receive warning notifications when they trigger sensitive words1.

2.2.3 In-path censorship

Despite client-side and server-side censorship, the censors can also take controlover the communication channel, such as control several routers inside the networkand inject false routing information so that those routers can discard or forward thepackets to wrong places.

2.2.4 On-path censorship

The censor deploys some devices besides the international gateway to monitor orinterrupt the network traffic, for research purpose, these devices such as NIDS(Networkintrusion detection system) can perform a huge amount of analysis by making cop-ies of the packets within the network communication channel. These systems havecapabilities to read inside the as well as inject additional information to packets.

2.3 Analysis and Blocking Mechanisms

Figure 2.2: A general censorship model

1https://phys.org/news/2016-12-app-china-censorship.html

11

The research only focus on the behaviors of adversaries appeared between theclient and server, more specifically, the GFW, China’s national level filtering sys-tem. The GFW uses various technical methods to disrupt the connections betweencensored regions and uncensored regions, including filtering, interference, tamper-ing, and surveillance. There are two main categories of censorship, in-path cen-sorship, and on-path censorship. The general censorship model is shown in figure2.2, the GFW takes network traffic such as IP packets as the input, then analyzesthem by a set of previously determined rules, the output of such analysis will be theinput of cost function and decision function. Finally, the system decides whetherthe traffic is blocked or forward to the destination. The model is the most generalone that can be applied to any censorship systems, the GFW is more sophisticated,but it is hard to make an assumption since the opaque of censorship infrastructure.One way to solve this is to read the literature or publications written by the GFW’sdesigners or predefine a set of testing samples to define which kind of methods theGFW use.

2.3.1 In-path censorship

Analysis approach

In-path censorship relies on stream-level analysis[15], which is based on a 3-tupleof source IP address, destination IP address and protocol. It is effective and muchsimpler compared to flow-level analysis since it only looks into IP headers andchecks whether the destination IP address is legitimate.

Decision approach

IP blocking is one of the earliest methods deployed by the GFW. It takes place inthe network layer. A trivial solution is to maintain an Access Control List (ACL)of blocked websites on gateway routers. When packets pass through the router, therouter first compares the destination IP addresses with ACL, then performs a seriesof actions, that is, either forwards the packets or silently discards them. Althoughthis solution is simple and straightforward, it is not effective for backbone networkwhich needs to handle the huge amount of network traffic, especially when thereis a large number of IP addresses in ACL since this operation takes additional timeto match the destination IP address. In addition to that, since routers have relat-ively small memory size, they are not suitable for storing the huge number of IPaddresses. Liu et al.[27] propose a lightweight control method based on routingdiffusion for the large-scale network, also called Border Gateway Protocol (BGP)hijacking. In this method, the GFW injects the wrong static routing informationsuch as a manually-configured routing entry to a gateway router, this router thenpeers with all gateway routers with the wrong routing information by using BGP

12

and Open Shortest Path First (OSPF) redistribution. Consequently, wrong informa-tion is propagated to all routers inside the censored domain, which means the GFWhijacks all traffic trying to access blocked websites in Autonomous system (AS),this traffic will be either lost in transit or redirect to traffic analyzers.

IP blocking has two limitations. First, the effects of IP address blocking relieson the accuracy of routing information, and it needs to be carefully maintained andupdated. The other limitation is that users can easily circumvent the censorship byusing proxies outside the censored domain. IP blacklisting can be combined withport blocking, e.g., selectively close the services on the server by checking andfiltering out packets with specific ports, such as port 22 for Secure Shell (SSH),80 for Hypertext Transfer Protocol (HTTP), 443 for Hypertext Transfer ProtocolSecure (HTTPS).

2.3.2 On-path censorship

Since IP blocking can be easily solved by using proxy methods, and the maintain-ing of an accurate ACL list costs a lot. Furthermore, the GFW is not capable ofdiscovering or putting all IP addresses into their blacklist. Due to these drawbacks,the designers start to find alternatives to recognize the network flow which is try-ing to bypassing the censorship. A good solution is to put traffic analyzers into thenetwork and block sensitive flow according to the content in the transport layer andapplication layer. In-path system is not suitable for performing the heavy analysis.Otherwise, it will lead to traffic delay and congestion. Instead, the GFW starts touse the on-path system which can handle extremely high throughput.

The GFW captures all IP packets from network to its side-channel traffic ana-lyzers, then reassemble those packets according to sequence number inside TCPheaders to perform DPI in the application layer. The GFW has its implementationof TCP/IP stack to resynchronize the TCP segments for further analysis. By now itcan distinguish many popular protocols such as HTTP and Domain Name System(DNS). They have several flow-level traffic analysis[3] mechanisms which basedon a 5-tuple with source IP address, source port, destination IP address, destinationport, and the protocol.

Analysis approach

• Port-based classification

Port-based traffic analysis is to check the port number inside the TCP header(e.g., port 80 means HTTP traffic). This method is very suitable for classi-fying massive network traffic by just mapping service on well-known portnumber, but the accuracy can be very low, in order to prevent services or

13

applications from attack, system administrators open their service on a dif-ferent port number instead of the default well-known port number, some ap-plications use dynamic port allocation. Moore et al.[30] show that by usageof well-known port number classification methods, a large amount of net-work traffic being unknown while a small amount of network traffic beingmisclassified.

• Deep packet inspection

Deep packet inspection is a way to inspecting network traffic flows passingthrough specific checkpoints to make a real-time decision. With this tech-nology, the GFW can wiretap sensitive keywords inside the packet or sus-picious network traffic flow. Unlike the traditional network analysis whichonly checks the structured information in packet headers, such as IP, TCP,UDP headers, DPI looks into the content of packets, namely the applicationlayer. A common way to acquire network packets for DPI is using port mir-roring, also known as span port, and an optical splitter. DPI can distinguisha bunch of application layer protocols, such as Peer-to-Peer (P2P), VoIP, au-dio/video streaming, it can be considered as an excellent way to identify thereal network traffic from packet encapsulation. The main approaches of DPIincluding payload-based classification and pattern-based classification.

• Payload-based classification

Traffic patterns can be identified by payload inside transport layer packets.This method can obtain high accuracy whereas the computational complex-ity is high. Moore et al.[30] show that by combining the port-based classific-ation and payload-based classification can increase the correctly-identifiedtraffic from approximately 70% to almost 79%. The other disadvantages in-cluding invasion of privacy and have not to effect on obfuscated or encryptednetwork traffic.

• Pattern-based classification

This method classifies network traffic flow based on the host behaviors in thetransport layer. Compared to port-based and payload-based classification,it only uses connection patterns rather than port and payload. Karagian-nis et al.[22] develop a systematic methodology to identify application layerP2P flow by using peer connection patterns without relying on packet pay-load. Based on this research, they later came up with a new system called“BLINC”[23] to analyze the traffic pattern from social, functional and ap-plication level. The evaluation shows that they can classify 80%-90% of thetraffic with more than 95% accuracy. The advantage of this approach is thatit can identify network traffic when the payload is encrypted. Furthermore,it reduces the computational complexity concerning payload-based classi-fication. However, this approach is still in the experimental stage since the

14

lack of large dataset, e.g., difficult to identify a new protocol or self-inventedprotocol.

• Machine learning classification

The methods mentioned before are some simple filter rules applied to filtersuspicious network traffic flow, and machine learning is designed for clas-sifying a sizeable real-time amount of traffic. It combines multiple featuresand applies the learning algorithms including supervised learning, unsuper-vised learning, and semi-supervised learning to achieve high accuracy. Net-work analyzers usually collect a large number of packets, based on the be-haviors of network traffic flow and a set of parameters inside the networkpackets as features or discriminators, then train a network traffic classifierthat can distinguish one type of traffic from another. Yuan et al.[46] pro-pose an accurate network traffic classification based on the SVM method,and the experimental result shows the accuracy achieved more than 97.17%.The advantage of machine learning based network traffic classification is thatit works for both unencrypted and encrypted data since it does not rely onthe payload of packets. Machine learning network classifying can achievehigh accuracy when the suitable features are chosen for a large dataset, butthe computational complexity is quite considerable for training an accuracymodel, it does not immediately work for new protocols or the situation whensome bits inside of the packet are changed in some old network protocols.

• Active probing

In some special cases, the GFW need to confirm that the server is runningprohibit protocols. In this situation, the GFW will establish connectionswith a suspicious proxy server like a regular client. If the server respondscorrectly and they successfully establish the connection, the GFW will takeblock action. The research conducted by Ensafi et al.[11] proves that theGFW using active probing to discover Tor bridges.

Decision approach

After sensitive flows are detected in the transport layer and application layer, theGFW applies multiple block mechanisms, typically TCP connection reset and DNSpoisoning:

• TCP connection reset

TCP connection reset is also known as TCP reset attack. It mainly workson the application layer and the transport layer. The GFW capture all IPpackets trying to pass through the international gateway, resynchronize themaccording to the sequence number. Once sensitive content is detected in theapplication layer, e.g., sensitive keywords appeared in HTTP GET requests,

15

the GFW inject RST(type-1) and RST/ACK(type2) packets to both client andserver to force the on-going connection shut down. Wang et al.[39] evaluatethe recent behavior of the GFW, after several tests, they found that the GFWcreates TCB depends on both SYN and SYN/ACK packets, and the GFWwill enter the resynchronization state in some cases. Xu et al.[45] find thatfiltering occurs in border gateways and many provincial networks. Perform-ing TCP connection reset is costly since it needs to record the connectionstate in the TCB, also known as Transmission Control Block, this data struc-ture keeps track of different information about each connection, e.g., localand remote port numbers, sender and receiver buffers, protocols, and currentsegments.

• DNS hijacking and poisoning

The Domain Name System is a hierarchical decentralized naming systemwhich maintains a directory of domain names and translates them to IP ad-dresses. DNS spoofing and hijacking work in the Application layer, thesetechniques are used in conjunction with IP blocking, since one domain namecould match several IPs servers, which means the GFW can block several ad-dresses at the same time, but if users enter the correct IP address of domainname, they could still access the destinations. The DNS server is normallyprovided by Internet Service Provider (ISP). It uses cache to remember thesetranslations for a period. Therefore, it can immediately reply DNS queriesuntil the cache expires. When DNS server has received the fake translation inits caches, it returns an incorrect IP address and delivers the traffic to anotherserver. The GFW usually injects false DNS queries to local DNS servers andalso performs intrusion detection in port 53. Once the GFW detects a sensit-ive query, it immediately injects the DNS reply with a fake IP address. Sincethe fake address returns much earlier than legitimate one, the DNS serverignores the last arrived one and only forward the false answer to users.

Figure 2.3: DNS hijacking and DNS poisoning

2.4 Obfuscation of censorship circumvention systems

The earliest adopted approach of the censors is only blocking the associated Inter-net Protocol (IP) address and port number of the destination. In order to bypass the

16

censorship, one of the most popular countermeasures is to send requests to a proxynode and let this node send blocked information back.

The success of proxy-based censorship resistance systems in practice has res-ult in the censors to deploy advanced Deep packet inspection (DPI) mechanismswhich can identify the traffic based on information in application layer as well asnetwork flow behaviors between two endpoints. Based on DPI technology, URL-based keyword filtering can be applied to routers where packets pass through. Con-sequently, the censors can analyze and tamper with the network traffic. As the Deeppacket inspection technology is widely used in Internet censorship, most mod-ern censorship resistance systems utilize encryption and obfuscation techniques toevade DPI actions. Those approaches can be classified as four major categoriesdescribed below: encryption, randomizer, mimicry, and tunneling.

2.4.1 Payload encryption

Conventional encryption can be a relatively good obfuscator to prevent networkpacket from the censors. Encryption, in this case, means the semantic-based en-cryption, that is, only the payload of the transport layer is encrypted. For example,the Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) headeris not encrypted while their payload is encrypted. Although encryption can preventdata from reading by the third party, but only encrypting payload does not guar-antee the reliable connections in the whole communication, since most encryptionprotocols indicate the protocol they used for data transmission in plain text header,this gives censor a chance to detect the exploitable fingerprints from network pack-ets easily. The negotiation of such encryption protocols always have some distinctpatterns, which are also obvious to censor, it usually transfer specific messagessuch as encryption methods or server’s certificates, which can be easily used asfingerprints or signatures by the censor. The famous example is that the GFW candetect sensitive HTTPS network flow and perform TCP reset attack.

2.4.2 Randomizer

Different from encryption, one alternative approach is to randomizing payload byapplying a stream cipher to every byte. This approach makes identifying finger-prints difficult because there are no characteristic patterns to observe.

Winter et al.[43] propose a thin protocol layer above the transport layer calledScrambleSuit, this protocol is used for application data obfuscation, and the entirenetwork traffic can be indistinguishable by using the pseudo-random payload. Thesame idea also used by Pluggable Tor Transport Obfsproxy including obfs2, obfs3,and obfs4. obfs2 is an obfuscation layer protocol inside the TCP protocol. Thedesign goal is to prevent the specific communication protocol being recognized by

17

a third party. However, It does not provide authentication and data integrity. Theprotocols have two phases, (1)establish keys and (2)exchange super encipheredtraffic. In its later version, obfs3 offers protection against passive Deep PacketInspection. This method cannot be detected without launching a probing attackagainst its handshake. Tor then proposes the obfs4 protocol, which is the combin-ation of ScrambleSuit protocol, elligator2 technique, and ntor protocol. Dust[42]proposed by Wiley et al. is designed for resisting censors deep packet inspectionwhich examining the fingerprints as well as packet payload. The protocol makesitself unobservability and indistinguishable from censors by constructing randompacket payloads. Although no visible fingerprint can be detected in the random-izer, “no fingerprint” itself becomes a feature, which means censors can abuse thedistinction between conventional protocols’ plain text headers and the randomizingobfuscators. As a result, this kind of applications can be easily detected by simpleheuristics, e.g., length checks and entropy-based tests.

2.4.3 Mimicry

Mimicry-based censorship resistant systems are trying to make packet payloadslook like allowable packets.

Moghaddam et al.[29] propose SkypeMorph, a Tor pluggable transport intendedto camouflage the network traffic as a Skype video call between two endpoints.Later, Weinberg et al.[40] design StegoTorus, a pluggable Tor transport forks fromObfsproxy. It split Tor streams across multiple connections and embed the trafficflows that look like HTML, JavaScript, or PDF. Wang et al.[38] propose a newframework for censorship-resistant web browsing called CensorSpoofer, it hidesthe upstream request contents such as URLs inside the instant messages and emailsand downloads the web content from target servers by performing IP spoofing.Burnett et al.[6] design Collage, which allows users embedding request messagesinside the user-generated content. For example, the photo-sharing sites. Althoughmimicry applications are trying to make packet payloads look like standard proto-cols, it does not use the real implementation of the standard protocol, which meansthe obfuscator is different from the actual protocols it tries to mimic and can be eas-ily detected by the adversary. For example, researchers find that SkypeMorph andStegoTorus fail even against the weakest censor[16]. Houmansadr et al.[16] showthat unobservability by imitation is a fundamentally flawed approach, and partialimitation is worse than no imitation at all. They suggest an alternative, using theoriginal protocol instead of imitating protocol.

2.4.4 Tunneling

Tunneling obfuscators encapsulate the hidden data in higher the protocol stack,which means the application run the actual implementation of protocols to transmit

18

secret data.

Houmansadr et al.[17] propose Freewave, which modulates a client’s networktraffic into acoustic signals and put it to VoIP connections target to FreeWaveserver, the server then extracts the hidden traffic and proxies them to a censoredInternet. In the latest paper published in 2017, Houmansadr et al.[18] propose away called SWEET(Serving the Web by Exploiting Email Tunnels) to hide inter-net traffic inside Email messages such as Yahoo Mail and Gmail. Barradas et al.[9]build a censorship-resistant system, named DeltaShaper, which tries to carry overstandard protocols such as FTP, SMTP, or HTTP through Skype video streams. Leeet al.[25] design, implement and evaluate CovertCast, a censorship circumventionsystem that encapsulates the blocked traffic into live-streaming services.

Decoy routing

Decoy routing is an example of tunneling. In this approach, a client encapsulatesits hidden requests over a proper communication channel to non-blocked destina-tions. Unlike other proxy-based circumvention techniques, in decoy routing, theclient does not directly connect to a blocked IP address to provide circumven-tion. Instead, a friendly “decoy” router acts as a reflector to intercepts this traffic,extracts the hidden data from packets, and transmits it to the target destinations.Traffic shaping is the essential part of decoy routing systems in order to protectagainst traffic analysis from censors.

The decoy routing idea is also presented by Karlin et al.[24] They show that theclient does not need to explicitly connect to the separate server IP which can beeasily blocked by the censors. An alternative way is to use the intermediate cir-cumvention devices which are hard to block by censors to proxying the networktraffic. One real-world implementation is Telex which is proposed by Wustrowet al.[44], it converts unblocked websites into proxies, with the help of Telex sta-tions deployed on friendly ISPs between censors’ networks and uncensored Inter-net domains. Telex stations would monitor seemingly network traffic flows andtransparently deflect them to the forbidden websites or services instead. Anotherreal-world example TapDance which is implemented by Frolov et al.[14] was de-ployed on four ISP uplinks with 100 Gbps bandwidth. It serves more than 50,000real-world users over one week. The result shows that TapDance can be practicallyoperated in ISP scale with good performance and reasonable cost. Bocovich etal.[5] suggest secure asymmetric solutions to previously symmetric decoy routingsystems. They later[4] propose Slitheen, which is a secure decoy routing systemcapable of perfectly imitate the traffic patterns of regular client and web server.

19

Domain fronting

Domain fronting is a special form of decoy routing, but unlike decoy routing, do-main fronting works at application layers by using HTTPS. It is easy to deploy anddoes not need further cooperation with ISP. The key idea is to use different destin-ation addresses at different network layers. It shows the allowable address as thefront domain which is outside of the HTTPS request, in the DNS query, and ServerName Indication (SNI) when establishing the Transport Layer Security (TLS) con-nections, but the Host field inside the HTTP header is the real address client wantto visit, and this part is invisible to censor. Since the censors cannot distinguishthe front address and the real address behind the HTTPS encryption, it will letthe traffic pass through unless censor blocks the entire domain which may causegreat collateral damage. Domain fronting has now implemented in various prac-tical software, e.g., Meek, which is the domain fronting tor pluggable transport.Fifield et al.[12] deploy various censorship circumvention applications includingTor, Lantern, and Psiphon for several months, and the experience demonstrates thatthese systems can serve thousands of users with transferring huge amount of dataper month. The result shows that domain fronting now becomes a circumventionworkhorse.

Challenges

There are two main challenges in tunneling, the first is the semantic mismatchbetween real protocol and tunneled data, which may cause the drop of the packetwhile transmitting the hidden data, e.g., tunnel too large packets through the net-work. The second is that censors can run algorithms like entropy tests to identifythe tunneled data from regular data. Researchers in China find the different pat-terns between normal https traffic and meek traffic by measuring packet intervaland packet size distribution. The results show that the anonymous communicationsystem is not unobservable.

2.5 P2P architecture - Build a trust network to overcomecensorship

The primary goal of Tor is to ensure anonymity. It uses peer-to-peer proxying toreroute the network traffic by releasing the default bridges inside the tor browserand bridge list file (a configuration file) with all the bridges on it. One disadvantageis that the censors can also join this network and block all those IPs. Fifield etal.[13] examine the average time that censors take to block the Tor bridges. Theyfound the bridges reachability varies and the blocking techniques advance duringthe measurement.

20

Based on the failure of Tor bridges, a fair proportion of systems aim to leveragea trust network to resist the censors, namely prevent it from Sybil attack, whichmeans such architecture should limit the number of malicious nodes. The practicalapplications including uProxy and Lantern.

A variety of research works propose the peer-to-peer architecture censorshipcircumvention systems. Sovran et al.[35] design, implement and evaluate the Kal-eidoscope which purposes the limited discovery protocol to prevent the censorsfrom joining the network. The same idea is used in Salmon, presented by Douglaset al.[10], which defines an algorithm that can quickly identify malicious usersfrom the network. Lee et al.[25] design the OverTorrent, which allows sharingInternet connections between using BitTorrent protocol. Shen et al.[34] proposeFreeweb, which relies on a network of widely distributed peer-to-peer (P2P) nodes,and on that basis, they improve the file access delay and reduce the node overloadproblem.

Previous works are elaborating on the network structures and node cooperationalgorithms, and assume that the number of nodes is enough for constructing a P2Pnetwork. In the real world, the nodes which are accessible and can provide thecircumvention services is crucial for a circumvention system, and the requestersor client nodes cannot access the blocked content without the forwarding of thesenodes. To incentivize the number of service providers, this work also focuses onhow to build a balanced P2P network by utilizing a blockchain-based economicalmodel.

21

22

Chapter 3

Empirical evaluation of the GFW

In chapter 2, some results from the literature are summarized, but the actual beha-viors of the GFW still need to be measured to understand the technical principlesbehind. This chapter describes the basic situation of the GFW. Section 3.1 intro-duces the experimental setups of the GFW measurement. Section 3.2 describeshow IP blocking methods works. The GFW uses TCP connection reset to disruptthe normal network traffics to blocked websites, these detecting methods includ-ing DNS and HTTP keyword detections are explained in section 3.3. Section 3.4shows how the GFW intercepts DNS resolution over UDP protocol by using DNShijacking and DNS poisoning. By measuring the different root causes that makeaccessing unavailable, the number of websites that are blocked by different mech-anisms is counted up. Finally, the measurement results and the threat model of theGFW is given in section 3.5. The contributions of this chapter including:

• A taxonomy of the behaviors of the GFW and build a threat model.

• A categorization of the root causes that make websites unavailable as a guidefor circumventing the censorship.

3.1 Experimental Setup

An open source the GFW blacklist hosted on Github1 is chosen for designing areasonable measurement experiment and constructing a threat model of the GFW.This list contains almost all the blocked websites reported by the netizens, andit is frequently updated and carefully maintained by a group of volunteers sinceFeb 22, 2009. Due to the opaqueness of the GFW, the file cannot fully record allblocked domain names but most of them. In general, volunteers manually access

1https://github.com/gfwlist/gfwlist/blob/master/gfwlist.txt

23

websites or using scripts to check availability before creating issues in the reposit-ory. The issues are carefully tested by administrators to ensure correctness. Thesedomain names on the list can be blocked by the GFW using various techniques,e.g., IP blocking or intrusion detection techniques including TCP connection resetand DNS hijacking/DNS cache poisoning.

Figure 3.1 shows the measurement pipeline for processing the GFW list. Al-though the latest version which updated on Jun 30, 2018, is used, it may containsome outdated information. For example, some servers may change their addressrecords on DNS server, or the GFW no longer blocking part of domains in theperiod. In the first step, the list is filtered by URLs, IP addresses and comments,and there are 6170 unique Fully Qualified Domain Name (FQDN) in total. The in-valid domain names are removed in the next step. Finally, the 4265 valid domainsare used for the categorization.

The next step is to find out the behaviors of the GFW and get the distribution ofwebsites affected by IP blocking, TCP connection reset and DNS hijacking/cachepoisoning.

With regard to estimate IP blocking and TCP connection reset proportion ofpoisoned domain names, a list of correct IP addresses is derived outside the cen-sored domain, and then they are categorized by a command-line tool curl. Someservers use the name-based virtual host, which means there are multiple domainnames on the same IP address. Therefore, the Host field is constructed within theHTTP header to ensure that the client can connect to the website correctly.

In order to distinguish all these cases, i.e., the proportion of websites affectedby these techniques separately, a bunch of command line tools are used to con-struct the requests and perform traffic analysis, including curl, dig, traceroute andtcpdump. curl is exploited to construct HTTP/HTTPS requests over TCP protocolin IP blocking and TCP reset measurements. Traceroute is used for sending theInternet Control Message Protocol (ICMP) packets to trace the path between clientand server. dig is utilized in DNS experiment in order to get address resolutions ofdomain names. tcpdump is used for client-side traffic analysis to find patterns ofthe GFW operating mechanisms.

3.2 IP blocking

IP blocking is one of the earliest methods adopted by the GFW due to its simplicityand directness. It works on the network layer. Once a packet passes through theInternet backbone and is ready to send to the Internet, the in-path blocking systemmatches the destination IP address in IP header with its blacklist. If it matches, thenpackets are dropped or filtered. There is no way to evade IP blocking mechanismunless the destination address field is filled with an unblocked IP address.

24

dig with TCP

Poisoned domains Correctly resolved doamins

Get real IP addresses

GFWList

Check curl return code

Server-side failure IP blocked TCP connection reset

7 28 56

Valid domains Invalid domains

Figure 3.1: Experimental pipeline for categorizing different root causes

During the traffic analysis, the application continuously sends the same pack-ets since there is no reply from the other side. For instance, the host sends SYNpackets repeatedly since it receives no response in the TCP connection. Figure3.2 shows the IP blocking mechanism. The reason behind is that IP packets arelost in the middle. Either are dropped or filtered by in-path devices and never getto the server. Therefore, the server will not give a reply to the client. Listing2 shows the traceroute path between the uncensored domain and one IP addressof www.google.com. Listing 1 shows the traceroute result queried from the cen-sored domain. In the first snippet, packets successfully reach the destination, whichmeans the server-side does not block the ICMP packets while from the second res-

Google Public DNSClient

GFW

1.DNS Query 2.DNS Query

3.RST/ACK 4.DNS Response

Ali DNS

Client

GFW

1.DNS Query 2.DNS Query

4.Wrong Response 3.RST/ACK

Google Public DNSClient

GFW

1.DNS Query 2.DNS Query

3.Wrong Response 4.DNS Response

Ali DNS

Client

GFW

1.DNS Query 2.DNS Query

6.Wrong Response 4.Wrong Response

Root DNS

3.DNS Query

5.DNS Response

Root DNS

3.DNS Query

5.DNS Response

Censored Domain

Censored Domain

Censored Domain

Censored Domain

ClientGFW

IP packets

Censored Domain Filtered

Client

GFW

TCP Segment

RST/ACK

Censored Domain

Web Server

RST/ACK

Keywords Detected

Client

GFW

DNS Query

Forged DNS Reply

Censored Domain

Web ServerKeywords Detected

Ali DNS

Correct DNS Reply

Client

GFW

DNS Query

Forged DNS Reply

Web ServerKeywords Detected

Correct DNS Reply

Censored Domain

DNS Query

Forged DNS Reply

Poisoned Cache

Web Server

Figure 3.2: IP blocking

25

ult, packets stop forwarding after they reach the router 119.147.219.241, whichbelongs to China Telecom from AS4816, one of the Chinese backbone networks.

$ traceroute -n 216.58.211.100traceroute to 216.58.211.100 (216.58.211.100), 64 hops max, 52 byte packets1 145.94.160.2 2.483 ms 1.272 ms 7.041 ms2 10.200.23.58 1.112 ms 1.121 ms 1.156 ms3 10.200.246.121 1.269 ms 1.160 ms 1.147 ms4 10.200.246.5 1.439 ms 1.153 ms 1.181 ms5 10.200.24.5 1.385 ms 1.283 ms 1.287 ms6 145.145.26.97 2.792 ms 4.937 ms 4.145 ms7 145.145.166.86 3.026 ms 2.910 ms 2.788 ms8 108.170.241.161 4.056 ms

108.170.241.129 2.997 ms108.170.241.161 4.022 ms

9 108.170.237.45 3.070 ms 3.379 ms 2.808 ms10 216.58.211.100 2.804 ms 3.064 ms 2.965 ms

Listing 1: Traceroute from uncensored domain

$ traceroute -n 216.58.211.100traceroute to 216.58.211.100 (216.58.211.100), 30 hops max, 60 byte packets1 * * *2 11.212.252.65 5.862 ms 6.276 ms 6.484 ms3 11.218.131.97 5.384 ms 11.218.131.173 4.878 ms 11.218.131.149 5.013 ms4 11.218.131.246 1.081 ms 1.116 ms 11.218.131.234 1.078 ms5 119.38.212.110 2.434 ms 119.38.212.102 1.416 ms 119.38.212.114 2.260 ms6 116.251.113.137 1.381 ms 42.120.242.217 2.470 ms 116.251.113.141 1.451 ms7 183.2.180.229 1.898 ms 183.2.180.93 1.909 ms 183.2.180.217 1.574 ms8 183.2.182.117 2.552 ms 183.2.182.125 2.157 ms 183.2.182.129 2.589 ms9 119.147.219.241 5.912 ms 119.147.222.13 10.056 ms 119.147.220.41 12.080 ms10 * * *11 * * *...

Listing 2: Traceroute from censored domain

The GFW has a list of blocked IP addresses and IP addresses range. It is too largefor us to match all IP addresses. Command line tool ping is to test the reachabilitybetween two nodes over a network layer, but considering that ping achieves itspurpose by constructing ICMP packets, and some firewalls on the server side mightblock the access of ICMP packets for security reasons. These scripts are designedby the command line tool curl. curl supports a set of application layer protocols,e.g., FTP, HTTP and HTTPS. In the experiment, the packets need to be sent toavailable ports. Available means ports are opened and not filtered by the firewall onthe server side. Explicitly, port 80(HTTP) is allowed by most servers. Therefore,the HTTP protocol is used to trigger the GFW. curl is a convenient tool since itcan return both the results and status code, e.g., the status code is 0 if connectionsuccess). Checking different status code can detect network connection problems.For example, the status code 7 and 28 indicates that someone in path interceptsthe normal traffic. Status code 7 means curl failed to establish TCP connectionto the host. The reason can be specified wrong port numbers, wrong hostnames,or firewalls exists in the path. Status code 28 means the connection reaches thespecified timeout period. In the script, the maximum time to connect to a website isten seconds. The way to distinguish several situations in status code 7 is executingthe scripts multiple times in clients both inside and outside the censored domain.The result shows that there are 498 valid domains’ IP addresses are blocked by the

26

GFW.

3.3 TCP connection reset

TCP(Transmission Control Protocol) is a reliable communication protocol in thetransport layer. It provides multiple mechanisms such as error detection, flow con-trol, congestion control, and retransmission. However, it is not designed for se-curely transmitting data between two nodes, i.e., TCP protocol does not providethe confidentiality of payload and authentication of two nodes’ identities. So TCPconnections can be easily intercepted or tampered with attackers. There are variousapproaches to interrupt the TCP connection, e.g., TCP reset attack, SYN floodingattack, and TCP session hijacking attack. A simple way to break the existing con-nection between client and server is TCP connection reset. Two different situationswhich can trigger TCP connection resets are explored in this section.

Google Public DNSClient

GFW

1.DNS Query 2.DNS Query

3.RST/ACK 4.DNS Response

Ali DNS

Client

GFW

1.DNS Query 2.DNS Query

4.Wrong Response 3.RST/ACK

Google Public DNSClient

GFW

1.DNS Query 2.DNS Query

3.Wrong Response 4.DNS Response

Ali DNS

Client

GFW

1.DNS Query 2.DNS Query

6.Wrong Response 4.Wrong Response

Root DNS

3.DNS Query

5.DNS Response

Root DNS

3.DNS Query

5.DNS Response

Censored Domain

Censored Domain

Censored Domain

Censored Domain

ClientGFW

IP packets

Censored Domain Filtered

Client

GFW

TCP Segment

RST/ACK

Censored Domain

Web Server

RST/ACK

Keywords Detected

Client

GFW

DNS Query

Forged DNS Reply

Censored Domain

Web ServerKeywords Detected

Ali DNS

Correct DNS Reply

Client

GFW

DNS Query

Forged DNS Reply

Web ServerKeywords Detected

Correct DNS Reply

Censored Domain

DNS Query

Forged DNS Reply

Poisoned Cache

Web Server

Figure 3.3: TCP connection reset

The curl return status code 56 means the domain triggers TCP reset attack. Theexit code is 35 if there is any TLS/SSL connection error. As for destination website,it is affected by the TCP connection reset with only one IP are used in patternmeasurement experiment to minimize the impact of Time to Live (TTL) field insidethe IP header. In the typical situation, a single server always communicates withthe client with the same TTL value. It can be easily detected if the GFW disruptsthe connection since the GFW construct packets with random TTL values.

3.3.1 HTTP keywords detection

According to the traffic analysis in Listing 3, when the client first connects to ablocked website, the connection is intercepted by the GFW after client and serversuccessfully made three-way handshake and client pushes the sensitive request, inthis case, mail-archive.com appeared in HTTP Host header field. The GFW dis-rupts the connection by sending one RST packet followed by three same RST/ACKpackets. The true acknowledgment comes later and is ignored by the client.

27

Listing 4 shows how the GFW works if the client connects to the website againin a certain period of time, after the client sends out the SYN packet, the GFW im-mediately impersonates itself as the server and sends an SYN/ACK packet, whichcauses the client to send successive packets to the GFW. The GFW can now sendback an RST/ACK packet and an RST packet to notify the client the connection isterminated. The SYN/ACK packet from the real server arrives after the client hasalready closed the connection. This result shows that the GFW is stateful, since itcan remember the connection states of the client and server in a period of time. Thetime maintains from the 90s to 95s, and when the time expired, the GFW lose theold connection state and sends the reset packet again after the client and blockedwebsites finishes three-way handshake.

00:00:00.000000 IP (tos 0x0, ttl 64, id 38830, offset 0, flags [DF], proto TCP (6), length 60)192.168.1.117.49576 > 72.52.77.8.http: Flags [S], cksum 0x5788 (incorrect -> 0xd21f), seq

2636432921, win 29200, options [mss 1460,sackOK,TS val 898126263 ecr 0,nop,wscale 7],length 0

↪→↪→

00:00:00.167822 IP (tos 0x14, ttl 49, id 0, offset 0, flags [DF], proto TCP (6), length 60)72.52.77.8.http > 192.168.1.117.49576: Flags [S.], cksum 0xce77 (correct), seq 2021363056, ack

2636432922, win 28960, options [mss 1460,sackOK,TS val 4038397413 ecr 898126263,nop,wscale7], length 0

↪→↪→

00:00:00.167858 IP (tos 0x0, ttl 64, id 38831, offset 0, flags [DF], proto TCP (6), length 52)192.168.1.117.49576 > 72.52.77.8.http: Flags [.], cksum 0x5780 (incorrect -> 0x6cd7), ack 1,

win 229, options [nop,nop,TS val 898126431 ecr 4038397413], length 0↪→00:00:00.167955 IP (tos 0x0, ttl 64, id 38832, offset 0, flags [DF], proto TCP (6), length 133)

192.168.1.117.49576 > 72.52.77.8.http: Flags [P.], cksum 0x57d1 (incorrect -> 0x3c39), seq1:82, ack 1, win 229, options [nop,nop,TS val 898126431 ecr 4038397413], length 81: HTTP,length: 81

↪→↪→

HEAD / HTTP/1.1User-Agent: curl/7.29.0Host: mail-archive.comAccept: */*

00:00:00.173402 IP (tos 0x14, ttl 61, id 0, offset 0, flags [none], proto TCP (6), length 40)72.52.77.8.http > 192.168.1.117.49576: Flags [R], cksum 0x5bff (correct), seq 2021363057, win

13474, length 0↪→00:00:00.174417 IP (tos 0x14, ttl 97, id 6299, offset 0, flags [DF], proto TCP (6), length 40)

72.52.77.8.http > 192.168.1.117.49576: Flags [R.], cksum 0x19f9 (correct), seq 1, ack 82, win4872, length 0↪→

00:00:00.174436 IP (tos 0x14, ttl 97, id 6299, offset 0, flags [DF], proto TCP (6), length 40)72.52.77.8.http > 192.168.1.117.49576: Flags [R.], cksum 0x19f9 (correct), seq 1, ack 82, win

4872, length 0↪→00:00:00.174458 IP (tos 0x14, ttl 97, id 6299, offset 0, flags [DF], proto TCP (6), length 40)

72.52.77.8.http > 192.168.1.117.49576: Flags [R.], cksum 0x19f9 (correct), seq 1, ack 82, win4872, length 0↪→

00:00:00.335710 IP (tos 0x14, ttl 49, id 55300, offset 0, flags [DF], proto TCP (6), length 52)72.52.77.8.http > 192.168.1.117.49576: Flags [.], cksum 0x6be0 (correct), ack 82, win 227,

options [nop,nop,TS val 4038397581 ecr 898126431], length 0↪→00:00:00.335734 IP (tos 0x14, ttl 64, id 18607, offset 0, flags [DF], proto TCP (6), length 40)

192.168.1.117.49576 > 72.52.77.8.http: Flags [R], cksum 0x32fe (correct), seq 2636433003, win0, length 0↪→

This listing shows the GFW sends one RST packet followed by three RST/ACK packets to terminate

the connection between client and server after detecting the keywords inside Host filed and URL.

The bogus packets send by the GFW is colored in red.

Listing 3: Packet capture example of TCP connection reset over HTTP protocol

3.3.2 DNS keywords detection

DNS(Domain Name System) is typically a hierarchical decentralized system thatmaps domain names to IP addresses. Querying DNS server is the first step afterusers input domain name in the address bar and hit enter. The DNS protocol sup-

28

00:00:03.732147 IP (tos 0x0, ttl 64, id 43057, offset 0, flags [DF], proto TCP (6), length 60)192.168.1.117.48636 > 72.52.77.8.http: Flags [S], cksum 0x5788 (incorrect -> 0xb244), seq

2201214804, win 29200, options [mss 1460,sackOK,TS val 873953125 ecr 0,nop,wscale 7],length 0

↪→↪→

00:00:03.739782 IP (tos 0x14, ttl 67, id 36317, offset 0, flags [DF], proto TCP (6), length 40)72.52.77.8.http > 192.168.1.117.48636: Flags [S.], cksum 0x4ef6 (correct), seq 41278637, ack

2201214805, win 2442, length 0↪→00:00:03.739808 IP (tos 0x0, ttl 64, id 43058, offset 0, flags [DF], proto TCP (6), length 40)

192.168.1.117.48636 > 72.52.77.8.http: Flags [.], cksum 0x5774 (incorrect -> 0xe670), ack 1,win 29200, length 0↪→

00:00:03.739961 IP (tos 0x0, ttl 64, id 43059, offset 0, flags [DF], proto TCP (6), length 121)192.168.1.117.48636 > 72.52.77.8.http: Flags [P.], cksum 0x57c5 (incorrect -> 0xb5d2), seq

1:82, ack 1, win 29200, length 81: HTTP, length: 81↪→HEAD / HTTP/1.1User-Agent: curl/7.29.0Host: mail-archive.comAccept: */*

00:00:03.745413 IP (tos 0x14, ttl 54, id 0, offset 0, flags [none], proto TCP (6), length 40)72.52.77.8.http > 192.168.1.117.48636: Flags [R], cksum 0x76c0 (correct), seq 41278638, win

17494, length 0↪→00:00:03.746091 IP (tos 0x14, ttl 68, id 35307, offset 0, flags [DF], proto TCP (6), length 40)

72.52.77.8.http > 192.168.1.117.48636: Flags [R.], cksum 0x4ef2 (correct), seq 1, ack 1, win2443, length 0↪→

00:00:03.893600 IP (tos 0x14, ttl 49, id 0, offset 0, flags [DF], proto TCP (6), length 60)72.52.77.8.http > 192.168.1.117.48636: Flags [S.], cksum 0x0e1b (correct), seq 521495779, ack

2201214805, win 28960, options [mss 1460,sackOK,TS val 4014223819 ecr 873953125,nop,wscale7], length 0

↪→↪→

00:00:03.893643 IP (tos 0x14, ttl 64, id 45705, offset 0, flags [DF], proto TCP (6), length 40)192.168.1.117.48636 > 72.52.77.8.http: Flags [R], cksum 0x37b1 (correct), seq 2201214805, win

0, length 0↪→

The GFW will record each TCP connection state between 90 and 95 seconds. During this period,

every SYN packet sends by the client is replied to with a fake SYN/ACK packet. The identities

can be distinguished from the TTL value inside the IP header. The GFW hijacks the three-way

handshake and sends an RST packet followed by an RST/ACK packet to terminate the connection.

The SYN/ACK sent by real server arrived later but then ignored by the client. The bogus packets

send by the GFW is colored in red.

Listing 4: Packet capture example of TCP connection reset over HTTP protocolduring a certain period

ports communication over both the TCP and UDP protocol, but it does not provideconfidentiality in data transmission since the payload is sent in plain text duringthe transmission. Due to this feature, the GFW also contains a module to triggerTCP connection reset targeting the DNS protocol.

Two Virtual Private Server (VPS) in the censored domain and uncensored do-main are deployed to test this mechanism. Two DNS servers are chosen includingGoogle Public DNS which is located outside the censored domain and the Ali DNSlocated inside the censored domain. The packet capturing tool tcpdump is used formonitoring network traffic to port 53, and the DNS lookup tool dig is to constructDNS queries over TCP.

The DNS query is sent over the TCP to Google public DNS 8.8.8.8, and theoutput shows where the connection is reset. The network traffic shows that afterthe normal three-way handshake, the host issues a DNS query which contains A?www.google.com to the other side, some devices in middle pretends to be 8.8.8.8and immediately returns three RST/ACK packets to acknowledge that previousdata is sent and then notify the host connection is reset, as shown in Listing 5.

29

The TCP connection is closed by the host, which means there is no such processlistening on the previous port. The true ACK packets and buffers which containsthe correct resolve address came later from 8.8.8.8 were ignored by the operatingsystem, and the host replies that by the RST packets — noted that one way todistinguish from correct DNS server from a fake one is to check the Time to Live(TTL) field. In this case, The correct DNS server has TTL value with 39 while theimposter is 157, TTL value with 64 means the packets are sent from the host.

Listing 5 reveals how the GFW performs the reset attack. It is similar to theTCP connection reset for HTTP keywords detection. The difference is that HTTPdetection is stateful since the GFW stores the connection state in their database fora period of time. For DNS protocol, instead, if querying the same domain namerepeatedly, the GFW will always return three RST/ACK packets to terminate theconnection. This shows for DNS over TCP, the GFW will not record the TCPconnection state.

00:00:00.000000 IP (tos 0x0, ttl 64, id 15284, offset 0, flags [DF], proto TCP (6), length 60)192.168.1.117.52458 > 8.8.8.8.domain: Flags [S], cksum 0xd25b (incorrect -> 0xfd94), seq

2016075502, win 29200, options [mss 1460,sackOK,TS val 4237944154 ecr 0,nop,wscale 7],length 0

↪→↪→

00:00:00.023463 IP (tos 0x14, ttl 39, id 10393, offset 0, flags [none], proto TCP (6), length60)↪→

8.8.8.8.domain > 192.168.1.117.52458: Flags [S.], cksum 0x6bfc (correct), seq 3781683001, ack2016075503, win 60192, options [mss 1380,sackOK,TS val 1040060966 ecr4237944154,nop,wscale 8], length 0

↪→↪→

00:00:00.023505 IP (tos 0x0, ttl 64, id 15285, offset 0, flags [DF], proto TCP (6), length 52)192.168.1.117.52458 > 8.8.8.8.domain: Flags [.], cksum 0xd253 (incorrect -> 0x849e), ack 1,

win 229, options [nop,nop,TS val 4237944177 ecr 1040060966], length 0↪→00:00:00.023721 IP (tos 0x0, ttl 64, id 15286, offset 0, flags [DF], proto TCP (6), length 97)

192.168.1.117.52458 > 8.8.8.8.domain: Flags [P.], cksum 0xd280 (incorrect -> 0xa472), seq1:46, ack 1, win 229, options [nop,nop,TS val 4237944177 ecr 1040060966], length 4510227+[1au] A? www.google.com. (43)

↪→↪→

00:00:00.030778 IP (tos 0x14, ttl 157, id 19375, offset 0, flags [DF], proto TCP (6), length 40)8.8.8.8.domain > 192.168.1.117.52458: Flags [R.], cksum 0xe209 (correct), seq 1, ack 46, win

3728, length 0↪→00:00:00.030781 IP (tos 0x14, ttl 157, id 19375, offset 0, flags [DF], proto TCP (6), length 40)

8.8.8.8.domain > 192.168.1.117.52458: Flags [R.], cksum 0xe209 (correct), seq 1, ack 46, win3728, length 0↪→

00:00:00.030817 IP (tos 0x14, ttl 157, id 19375, offset 0, flags [DF], proto TCP (6), length 40)8.8.8.8.domain > 192.168.1.117.52458: Flags [R.], cksum 0xe209 (correct), seq 1, ack 46, win

3728, length 0↪→00:00:00.047164 IP (tos 0x14, ttl 39, id 10403, offset 0, flags [none], proto TCP (6), length

52)↪→8.8.8.8.domain > 192.168.1.117.52458: Flags [.], cksum 0x8453 (correct), ack 46, win 236,

options [nop,nop,TS val 1040060989 ecr 4237944177], length 0↪→00:00:00.047196 IP (tos 0x14, ttl 64, id 37215, offset 0, flags [DF], proto TCP (6), length 40)

192.168.1.117.52458 > 8.8.8.8.domain: Flags [R], cksum 0xb94c (correct), seq 2016075548, win0, length 0↪→

00:00:00.047757 IP (tos 0x14, ttl 39, id 10404, offset 0, flags [none], proto TCP (6), length113)↪→

8.8.8.8.domain > 192.168.1.117.52458: Flags [P.], cksum 0x9a74 (correct), seq 1:62, ack 46,win 236, options [nop,nop,TS val 1040060990 ecr 4237944177], length 6110227 1/0/1www.google.com. A 172.217.27.132 (59)

↪→↪→

00:00:00.047764 IP (tos 0x14, ttl 64, id 37216, offset 0, flags [DF], proto TCP (6), length 40)192.168.1.117.52458 > 8.8.8.8.domain: Flags [R], cksum 0xb94c (correct), seq 2016075548, win

0, length 0↪→

The bogus packets send by the GFW is colored in red.

Listing 5: Packet capture example over TCP

30

3.3.3 TCP connection reset module

Figure 3.4 presents how a basic TCP connection reset device[28] works. It con-tains several pluggable modules. The capturing module collects all the TCP pack-ets from real-time network traffic. The update module updates the connection in-formation stored in the database. In the meanwhile, the decision module decideswhether to perform intrusion prevention rules and use the blocking module to con-struct fake packets after computation before sending to targets.

The behavior of the GFW can be applied to this model. The GFW sits in thepath of the international gateway and makes a copy of network packets. The TCPpacket capturing module is responsible for capture TCP packets after that the GFWcollects connection information including source Media access control (MAC), IPaddress, TCP port as well as destination MAC address, IP address, TCP port. Inaddition to that, the GFW also collects sequence number and acknowledge numberfrom request/response packet from both sides to construct a valid packet. Oncea client connects to the server, the TCP connection information will be collec-ted or updated, then stored in the database for a certain period. The request cen-sored keywords sent by the client can trigger the blocking module, which will sendthe command to construct a fake packet according to previous updated connectionstate, e.g., compute SYN and ACK numbers, switch source and destination ad-dresses. The fake RST or RST/ACK packets are sent to both client and server sidein order to break the connection on both sides.

TCP packet capturing

Connection statestorage

Do nothingFilter rules

Construct packets Send fake RSTpackets

Update connectionstate

Network traffic

TCP connection reset

Not block

Connection  info

Block

Figure 3.4: The model of TCP connection reset device

31

3.4 DNS hijacking and DNS cache poisoning

3.4.1 How the GFW intercepts DNS resolution

This section shows the experiment on how the GFW performs DNS hijacking overUDP protocol, as well as DNS poisoning, which is the side effect caused by DNShijacking. DNS hijacking takes advantage of the properties of the UDP protocol. Itis known that UDP protocol is unreliable, senders send out the packets and do notrequire acknowledgments. This allows attackers to send forged DNS replies. Theclient accepts the injected replies since they arrive earlier than legitimate replies.

When sending DNS queries over the UDP protocol, the output of dig does notshow any error message. Instead, it returns a type A records with an IPv4 addressshown in Listing 6. This address, however, is invalid because it does not point tothe real website. Traffic analysis shows that after the host sends the DNS querythat contains blocked keywords, some device disguises itself as a destination DNSserver and immediately sends two packets with fake resource records. A few mil-liseconds later, the correct response is ignored by dig as it already used the firstforged answer. Figure 3.5 shows how DNS hijacking works for TCP and UDPprotocol.

Google Public DNSClient

GFW

1.DNS Query 2.DNS Query

3.RST/ACK 4.DNS Response

Ali DNS

Client

GFW

1.DNS Query 2.DNS Query

4.Wrong Response 3.RST/ACK

Google Public DNSClient

GFW

1.DNS Query 2.DNS Query

3.Wrong Response 4.DNS Response

Ali DNS

Client

GFW

1.DNS Query 2.DNS Query

6.Wrong Response 4.Wrong Response

Root DNS

3.DNS Query

5.DNS Response

Root DNS

3.DNS Query

5.DNS Response

Censored Domain

Censored Domain

Censored Domain

Censored Domain

ClientGFW

IP packets

Censored Domain Filtered

Client

GFW

TCP Segment

RST/ACK

Censored Domain

Web Server

RST/ACK

Keywords Detected

Client

GFW

DNS Query

Forged DNS Reply

Censored Domain

Web ServerKeywords Detected

Ali DNS

Correct DNS Reply

Client

GFW

DNS Query

Forged DNS Reply

Web ServerKeywords Detected

Correct DNS Reply

Censored Domain

DNS Query

Forged DNS Reply

Poisoned Cache

Web Server

Figure 3.5: DNS hijacking

00:00:00.000000 IP (tos 0x0, ttl 64, id 26525, offset 0, flags [none], proto UDP (17), length71)↪→

192.168.1.117.49964 > 8.8.8.8.domain: 14538+ [1au] A? www.google.com. (43)00:00:00.005060 IP (tos 0x14, ttl 44, id 0, offset 0, flags [none], proto UDP (17), length 76)

8.8.8.8.domain > 192.168.1.117.49964: 14538 1/0/0 www.google.com. A 31.13.85.16 (48)00:00:00.008538 IP (tos 0x14, ttl 103, id 29352, offset 0, flags [DF], proto UDP (17), length

76)↪→8.8.8.8.domain > 192.168.1.117.49964: 14538 1/0/0 www.google.com. A 31.13.80.17 (48)

00:00:00.022273 IP (tos 0x14, ttl 39, id 29402, offset 0, flags [none], proto UDP (17), length87)↪→

8.8.8.8.domain > 192.168.1.117.49964: 14538 1/0/1 www.google.com. A 216.58.197.100 (59)

The bogus packets send by the GFW is colored in red.

Listing 6: Packet capture example over UDP

In general, DNS cache poisoning is more efficient than DNS hijacking, sinceattackers only attack target once and let the fake resource records stored in DNS

32

cache for one or two days. Regarding DNS poisoning, the DNS server is changedto Ali DNS server(223.5.5.5 and 223.6.6.6) inside the censored domain, and thena DNS query is sent over TCP and UDP. According to the output of tcpdump, theconnection was not intercepted by the third party, but both resource records wereincorrect. The reason is that when Ali DNS server tries to perform the recursiveDNS query(e.g., send a query to a DNS root server), some devices in betweenintercept the connection and construct fake packets, which result in the invalidresource record stored in Ali DNS cache. Since the GFW can read the content ofthe packets, the attack is easy to perform as the GFW does not need to guess the 16-bit id in DNS packets. It copies the ID from the query and uses the destination DNSserver’s address as source address for the forged response, then sends it back to theuser. Figure 3.6 illustrates DNS cache poisoning over TCP and UDP protocol.

Google Public DNSClient

GFW

1.DNS Query 2.DNS Query

3.RST/ACK 4.DNS Response

Ali DNS

Client

GFW

1.DNS Query 2.DNS Query

4.Wrong Response 3.RST/ACK

Google Public DNSClient

GFW

1.DNS Query 2.DNS Query

3.Wrong Response 4.DNS Response

Ali DNS

Client

GFW

1.DNS Query 2.DNS Query

6.Wrong Response 4.Wrong Response

Root DNS

3.DNS Query

5.DNS Response

Root DNS

3.DNS Query

5.DNS Response

Censored Domain

Censored Domain

Censored Domain

Censored Domain

ClientGFW

IP packets

Censored Domain Filtered

Client

GFW

TCP Segment

RST/ACK

Censored Domain

Web Server

RST/ACK

Keywords Detected

Client

GFW

DNS Query

Forged DNS Reply

Censored Domain

Web ServerKeywords Detected

Ali DNS

Correct DNS Reply

Client

GFW

DNS Query

Forged DNS Reply

Web ServerKeywords Detected

Correct DNS Reply

Censored Domain

DNS Query

Forged DNS Reply

Poisoned Cache

Web Server

Figure 3.6: DNS cache poisoning

When performing the same experimental steps outside the censored domain, theGFW’s behavior is the same except all the resource record returned are incorrectdue to DNS cache poisoning.

According to the test results, the GFW uses DPI to checks keywords inside DNSpackets at the application layer. If the packet contains keywords that are blacklis-ted. The GFW will trigger the specific rules. The GFW has different filtering rulesdepend on the underlying transport protocol, either send RST/ACK packets to in-tercept TCP connection or a fake response over UDP. Based on the results obtainedfrom outside the censored domain, the filtering rules appear to be bidirectional forboth inbound and outbound network traffic.

The response constructed by the GFW over UDP protocol is straightforwardsince regardless of which type is queried, e.g., Name Server (NS), Start of Author-ity (SOA) or Mail exchanger (MX), the resolution result is always an A type recordwith a forged IP address. These poisoned domain names are collected to get thegeolocation. Each of them is queried 101 times to Google public DNS server byusing the UDP protocol in order to trigger the GFW and see the pattern behind theforged addresses. The 770 unique fake IP addresses are collected. Table 3.1 showsAS number, company, and occurrence of IP related to these poisoned IP addresses.Most bogus IP addresses are coming from IP blocking list. The results indicatethat if users do not take active measures preventing DNS hijacking/poisoning, theeffect is the same as IP blocking: packets will not be forwarded to the destination.

33

AS Number Owner Country Example CIDR Range IP Occurrence

AS32934 Facebook, Inc. US

31.13.64.0/1969.63.176.0/20173.252.64.0/1966.220.144.0/2069.171.224.0/20

456(41.1%)

AS40824 WZ Communications Inc. US204.155.144.0/2074.117.176.0/21199.101.132.0/22

116(10.5%)

AS36351 SoftLayer Technologies Inc. US

174.36.0.0/1567.15.0.0/1674.86.0.0/1667.228.0.0/16208.101.0.0/18

67(6.0%)

AS19679 Dropbox, Inc. US 108.160.160.0/20 64(5.8%)

AS13414 Twitter Inc. US199.96.60.0/22199.16.156.0/22199.59.148.0/22

63(5.7%)

AS10753 Level 3 Parent, LLC US69.63.176.0/2066.220.144.0/20

57(5.1%)

AS15169 Google LLC US

74.125.0.0/16172.217.0.0/16209.85.128.0/17216.58.192.0/19

39(3.5%)

AS11172 Alestra, S. de R.L. de C.V. MX

74.86.0.0/16208.101.0.0/18174.36.192.0/1867.228.96.0/1975.126.112.0/20

38(3.4%)

AS22773 Cox Communications Inc. US74.86.0.0/16174.36.192.0/1867.228.192.0/19

36(3.2%)

AS22561 CENTURYLINK, INC. US 66.220.144.0/20 30(2.7%)

AS2914 NTT America, Inc. US128.242.0.0/16168.143.0.0/16124.40.32.0/19

30(2.7%)

AS19281 Quad9 US 216.58.192.0/19 13(1.2%)

AS13767 DataBank Holdings, Ltd. US72.233.0.0/1972.232.168.0/22

7(0.6%)

AS22576 DataPipe, Inc. US72.233.0.0/1972.232.168.0/22

7(0.6%)

AS31815 Media Temple, Inc. US 64.13.192.0/18 7(0.6%)

Table 3.1: Poisoned IP addressesThe IP addresses returned in bogus DNS packets sent by the GFW are blocked in the network layer.

34

Poisoned Domains 1435Correctly Resolved Domains 4265Invalid Domains 470Total 6170

Table 3.2: The proportion of poisoned domain names

3.5 Experimental results and summary

3.5.1 Threat model

The experiments reveal that the GFW has both in-path and on-path devices. Thein-path devices focus on dropping or filtering the packets while the on-path deviceshave ability observing the packets on the fly, but do not directly tamper pack-ets. The detection methods including keywords detection, any suspicious networktraffic will trigger the rules predefined. These devices will further construct someforged packets to intercept normal connections between client and server.

• The GFW does not target on decrypting data between server and client. Thegoal of the GFW is to intercept, terminate or block the connections.

• The GFW can block the packets in the network layer by their in-path devices.

• The GFW can monitor the state of connection in the application layer andlook inside the packets.

• The GFW has multiple intrusion detection modules, and each of them has anintrusion prevention rule module, e.g., detecting keywords in HTTP headerswill trigger three RST/ACK packets.

3.5.2 Results and Circumvention suggestions

The measurement results are listed in Table 3.2 and Table 3.3. In general, TCP con-nection reset accounts for a large part in the GFW list, and there are 139 domainsaffected by both DNS poisoning and IP blocking while 923 domains are affectedby DNS poisoning and TCP connection reset. After comparing results with the Al-exa top 100 global sites, 36 of 100 websites are blocked by IP addresses, and mostof them are also affected by DNS poisoning. These websites including popularsearch engine, e.g., Google, video-sharing websites, e.g., YouTube and Dailymo-tion, social networks, e.g., Facebook and Twitter, as well as cloud storage such asDropbox. Most unpopular or unknown websites are affected by only DNS poison-ing or the TCP connection reset.

The result shows that circumventing the GFW by directly attacking its on-pathdevice is less efficient, e.g., desynchronizing the connection state inside the data-

35

Poisoned Domains Correctly Resolved DomainsIP Blocking TCP Connection Reset IP Blocking TCP Connection Reset139 923 498 2219

Table 3.3: The number of domains affected by each blocking mechanism

base since most popular websites are blocked at the network layer. The reliableway is to use the IP addresses that cannot be blocked by the GFW to forward thetraffic, for instance, using Virtual Private Network (VPN) to encrypt the whole IPpayload, or using a proxy outside the censored domain.

According to the measurement in TCP connection reset, the GFW will use DPIto detect whether there are keywords inside the packets, for example, the Host fieldin HTTP headers or Question field in DNS request packets, or whether the patternis apparent to detect, e.g., the certificate exchange in TLS handshake protocol.These intrusion detection rules indicate that the payload should be meaninglesstowards the detection from the GFW, one countermeasure can be encryption orrandomization as which mentioned in the previous chapter. In addition to that,the handshake phase and data transmission phase cannot contain distinct features.Tunneling the data over other allowable protocols can be a way to solve this.

36

Chapter 4

Circumvention System Design

This chapter introduces a peer-to-peer based proxy system called the MultiProxy.Section 4.1 describes the design goals of the system. Section 4.2 gives a briefoverview of the system architecture. It contains three major modules which willbe described in the following three sections. The traffic forwarding module is ex-plained in section 4.3 including the SOCKS5 and TCP protocol, which provides thebasic traffic forwarding and circumvention service. Next, the threats of the Multi-Proxy system are analyzed in section 4.4. With the token economy implementedby the Trustchain protocol and the Intel SGX technology, a robust network is builtwith a balanced number of service consumers and providers. In order to provideprivacy for request originators, the system also uses anonymous end-to-end com-munication, and it is explained in section 4.5. Finally, the implementation detailsof the token economy and anonymous routing are declared in section 4.6.

4.1 Design goals

In recent years, various methods have been developed to circumvent the censorship,such as the VPN and SSH tunnel. Shadowsocks, as an open source encryptedproxy software, has been widely used in mainland China to circumvent Internetcensorship. This project has gained 26753 stars and 16626 forks on Github since2012. The software has been ported to different platforms and operating systems.The idea behind Shadowsocks is to split socks proxy into two parts: the clientinside the censored domain, and the server outside the censored domain. Variousencryption methods are supported to protect the data on the path between the clientand the server, e.g., AES, RC4-MD5, Salsa20, Chacha20, etc. Shadowsocks worksefficiently for single users because of its lightweight design and good obfuscation.The regular use case is that users use their Shadowsocks clients to connect its ownShadowsocks server outside the censored domain.

37

Asia 117.18.232.200Europe 152.199.19.160US 72.21.81.200

Table 4.1: The IP addresses of meek server

However, the system also has some drawbacks. Shadowsocks is based on Client-Server architecture in its current implementation, which means it only supportsone or multiple clients sharing the same server. This may cause the server-sidesingle point of failure. Although the encryption method is good enough to ob-fuscate the protocol against the DPI technology of the GFW, the GFW still hassome possibilities to integrate advanced methods to detect network traffic, e.g.,machine learning[8] and entropy test[36]. Once the special network traffic is de-tected, Shadowsocks server can be easily discovered or blocked by the GFW. Usershave no choices other than set up their new Shadowsocks servers. In addition tothat, bandwidth can be limited in this architecture because usually a single user canonly make use of the bandwidth of one server in consideration of server-side cost.

The Tor project built on a distributed and anonymous network, developed a newpluggable transport meek[12] in 2014. This pluggable utilizes the domain frontingtechnique as the obfuscation layer to evade the censorship. The forbidden networktraffic is relayed to a forwarder called meek server on the unblocked CDN network.Meek is now the only way for tor clients to access blocked websites from China.The figure 4.1 shows a Tor circuit established using meek. The first hop is thebridge with the meek server called “cymrubridge02”. This bridge is shared for allthe Tor browsers which use the meek client. Figure 4.2 shows its clients numbervariation in the last year, and it increased from 5,000 to 10,000. The IP addressesbehind the meek server “meek.azureedge.net” are shown in Table 4.1, the resultsare obtained by the domain name querying on the global scale. Although it iscapable of a server to handle up to 10,000 clients at the same time, the server stillcan be the bottleneck of the performance.

Figure 4.1: A Tor circuit with pluggable transport meek

In order to overcome both drawbacks of the current existing client-server based

38

Figure 4.2: The average connected clients number of cymrubridge02[32]

and distributed circumvention systems, a design of a decentralized censorship res-istance system called MultiProxy is described in this chapter. MultiProxy relies onthe concept of peer-to-peer network and provides a way for users inside the cen-sored domain make use of multiple proxies at one time, it also proposes the tokeneconomy as an incentive method to balance the service consumers and providers.The detailed system architecture and components are listed in section 4.2.

4.2 System Architecture

SOCKS5 on edge

Trustchain

Peer discovery

Token economy

Intel SGX

Multi-hop routing

Data encryption

Figure 4.3: Circumvention system architecture

Figure 4.3 describes the main components and protocol stack of the MultiProxysystem including SOCKS5 protocol on the edge, the token economy based on theTrustchain protocol and Intel SGX technology, onion routing and peer discoveryprotocol. All the functionalities are built on top of the peer-to-peer system. Themain components will be introduced in the following subsections.

39

SOCKS5

Application

Censored Domain

Multiproxy node

InternetTCP

Target Web ServerClient Node/ Entrance Node

Server Node/ Exit Node

TCP/UDP

Multiproxy node

Figure 4.4: Traffic routing path

Figure 4.4 shows the primary traffic routing path of the MultiProxy system.There are two types of proxy nodes that exist in the system, mainly the clientproxy node and the server proxy node. The client proxy can be the originator ofthe requests, and it can also act as the relay node which is used to forward networktraffic when the circuits are built in onion routing. The server proxy node, alsoknown as the exit node, is the critical component since it not only acts as the serverto receive the data from the client proxy, but also acts as the client to build theconnections between the target web servers to provide the circumvention service.A client node will at least find one exit node to build the circuits and get access totarget web servers.

The essential components of the system are listed in Table 4.2. As a circumven-tion system, the basic functionality is traffic forwarding, so the client node mustforward its traffic to the nodes which provide related circumvention service. It re-quires the network data forward protocol between multiple applications and Mul-tiProxy system on the same host.

In order to transfer data between the nodes, messages should be tunneled and en-crypted to evade the various network traffic analysis techniques of the GFW. Thesecond functionality is designed to make better use of multiple nodes. The num-ber of service providers and consumers should be balanced to make the systemavailable and robust. MultiProxy also provides the onion routing as anonymouscommunication over the network. MultiProxy also makes use of onion routing toprotect the identity and privacy of request originators. The messages are encapsu-lated in each node it passes through in a way that the intermediate nodes cannotknow the source and destination of the messages. This mechanism requires cryp-tography key exchanging protocol among the circuit(the path between multiplenodes).

4.3 Traffic forwarding

4.3.1 SOCKS on edges

The basic functionality of MultiProxy is traffic forwarding. This is achieved byadding a proxy layer on top of the traditional client-server architecture as shown

40

Functionality Goal Requirements

Traffic forwarding Provide basic circumvention service SOCKS5 protocolData encryption

Token economy Make the system robust Trustchain protocolIntel SGX, SCONE

Multi-hop messaging Privacy protection for request originators Circuits creationData tunnelling

Table 4.2: MultiProxy system components

in Figure 4.4. The proxy layer contains two primary components: a client nodeand a server node. The client node is used to receive the data from original ap-plications, e.g., browsers. This client node can be considered as an extension ofthe applications since it can modify encrypted data from source applications, orchange the destination address of a receiver. In order to forward the applicationdata to MultiProxy, a proper data transmission protocol is needed.

The SOCKS5 is employed as the data transmission protocol between applica-tions and the client node. SOCKS is a message exchange protocol which operatesbetween the application layer and transport layer. It is designed to transparently andsecurely traverse a firewall. In its latest version five which defined in RFC 1928[26]supports the authentication, IPv6, and transferring data over both TCP and UDP.SOCK5 contains three phases: negotiation, address transmission, and data trans-mission as shown in Figure 4.5. When the protocol is initiated, the SOCKS5 clientand the SOCKS5 server exchange various methods that will be used in the negoti-ation phase, the client then sends the destination address to the server. The serversubsequently notifies the client that the connection is successful by replying with abinding message. The client and the server start to transfer data in the third phaseif the connection built successfully.

The reasons why MultiProxy adopts SOCKS5 protocol between the applicationsand client proxy is because of its flexibility and extensibility. Since SOCKS5 notonly support the transport layer protocol like TCP and UDP but also can forwardmultiple application layer protocols such as HTTP and HTTPS.

In the MultiProxy system design, the server node is located in the uncensoreddomain and waits for data from the client node through the secure channel to for-ward to the free Internet. The purpose is to decrypt, reconstruct or modify databefore relaying it to the specified destination address, such as a web server that isblocked by the censorship system. Once the server node receives the data from theremote website, it will forward the traffic back to the client node.

41

SOCKS5 Client SOCKS5 Server

Target Server

VER NMETHOD METHODS

VER METHODS

VER … DST.ADDRATYP DST.PORT

VER … BND.ADDRATYP BND.PORT

Request

Response

1. Negotiation

2. Address Transmission

3. Data Transmission

Figure 4.5: Work flow of the SOCKS5 protocol

4.3.2 Peer-to-peer system

The proxy layer works appropriately only if the GFW does not know the addressof the server node, since the server can be blacklisted. In order to make the sys-tem more scalable, i.e., the client node can use multiple server nodes to forward itstraffic. In this manner, even if some servers are taken down by the GFW which canhappen in a traditional Shadowsocks setup, the client node still has other choices.Since Shadowsocks proxies are typically located in the public clouds, the censorscannot block the entire domains since it would lead to tremendous collateral dam-age. By combining all the SOCKS proxies into a single, shared infrastructure, thesystem becomes more resilient against the blocking of individual servers than in aunilateral setup. In addition to that, multiple server nodes can improve bandwidth.For these reasons, MultiProxy is built on top of the peer-to-peer network.

The nodes in the peer-to-peer network are unlike the traditional servers whichhave the fixed and known IP addresses. Similar to the human relationships of thereal world, somebody joins the network by the introduction of the member who isalready part of the network. The way for a peer join network requires it at leastconnects to a node that is already in this network. In practice, the status of nodesis dynamically changed so that newly joining node cannot immediately find a peer.For this reason, some bootstrapping nodes, which is also known as rendezvous

42

hosts, are set up to provide initial peer-to-peer network configuration such as IPaddresses and service identifiers. As shown in Figure 4.6, for peers which do notknow any other peers in the network, they will first connect to the trustworthy boot-strap node and register their addresses as well as service identifiers, which is usedto specify what kind of service they provide. The bootstrap node sends its config-urations back to both peers. Nodes are allowed to discover and communicate witheach other only if they have the same service identifiers. After the introductions,each node in the network maintains its list of known peers. They send ping andpong messages periodically to keep track of the status of other nodes, e.g., updatenew nodes or remove the off-line nodes.

Service Id 1

Service Id 2

Bootstrapping node

Introduction request/response

Node

Figure 4.6: Bootstrapping of a peer-to-peer system

4.4 Token economy

The unregulated system can only work in practice or reach a stable state if theinterests between the parties who provide services and the ones that consume ser-vices are balanced. In this case, MultiProxy only works reliably if people in theuncensored domain are willing to share their resources and contribute them to acommon circumvention. In MultiProxy, the primary challenge is to balance thenumber of servers and clients.

Nowadays many people already run their Shadowsocks servers in the publicclouds. This means a new mechanism to incentivize that people set up their serverinstances into the MultiProxy system when they want to use it to avoid the problemof free-riders[1]. The free-rider problem is a significant threat for peer-to-peersystems in general and has been extensively studied for systems like BitTorrent inwhich peers only download files without upload anything, which is unfair for peerswho contribute to the network and can have negative effects on the system. Thesolution for optimizing the download speed in BitTorrent is to use tit-for-tat [7],

43

which originates from the highly effective strategy of game theory. In this strategy,BitTorrent will choke uncooperative peers who do not contribute to upload filesand allocate its limited upload slots to other more cooperative peers.

The free-rider problem in MultiProxy system means peers use the circumven-tion services provided by server nodes without contributing back by providing thebandwidth of its proxy server for other nodes. It can lead to the unbalanced num-ber of proxy servers and clients, namely the number of clients is far greater thanthe number of servers. To avoid this phenomenon, MultiProxy introduces tokensas a way to create incentives and balance the providing of resources against con-sumption. More specifically, in MultiProxy system, the client node must pay sometokens before using the circumvention service provided by a server node. In themeanwhile, the server node can earn tokens by providing its circumvention service.

4.4.1 Analysis of threats to the system

In an open environment like the Internet, some nodes might be selfishly and ab-use the system for their use. This section describes the threats and challenges thatMultiProxy has to confront. There are three main threats that the system will beencountered: the single point of failure, token falsification, and invasion of dataprivacy. The first two threats can be summarised as where and how to store thetokens. Therefore, MultiProxy focuses on solving two problems of the token eco-nomy, namely how to protect tokens and how to protect data and sensitive inform-ation transferred in between.

Single point of failure

One of the weaknesses of centralized transaction servers is the single point of fail-ure. The single point of failure (SPOF) can lead the whole token economy systemstop working if the centralized transaction servers are compromised or fail, and thetrust problem means that the centralized transaction servers have rights to tamperor revise the records, and there is no one in the network can verify its correctness.

Forging tokens

Even with the help of the decentralized accounting system, there are also somethreats that exist in the system. The first threat is that the malicious server nodedoes not provide equivalent services after taking the number of tokens. In mostcases, a server node might earn the token without providing equivalent service.

For instance, a malicious server could be located inside the censored domain, forthe sole purpose of mining tokens. However, it does not provide the real circum-vention service since it lacks the capability of relaying packages to the uncensored

44

Internet. This malicious behavior can be detected by challenge-response authen-tication as shown in Figure 4.7. If a server node chooses to be a circumventiveserver, it must complete the challenge-response authentication, which is used todefine whether the server node acts its role correctly, in this case, send back thecontent of some specified blocked websites.

Blocked by the GFW

Client

GFW

IP packets

Censored Domain Filtered

Client

GFW

TCP Segment

RST/ACK

Censored Domain

Web Server

RST/ACK

Keywords Detected

Client

GFW

DNS Query

Forged DNS Reply

Censored Domain

Web Server

Keywords Detected

Ali DNS

Correct DNS Reply

Client

GFW

DNS Query

Forged DNS Reply

Web Server

Keywords Detected

Correct DNS Reply

Censored Domain

DNS Query

Forged DNS Reply

Poisoned Cache

Web Server

1. Challenge

4.Wrong Response

2. Challenge

Client Malicious Server

Censored Domain

?Message

Client Malicious Server

Censored Domain

MITM: 1. the reason why to use hidden service since they do not know the orignal

client, so evil server in this case only know the message/url but do not know who

actually send the message

2,3,4, the reason why to use SGX with trustchain

challenge-response: the method to define whether the server act as the role correctly.

Reactions of a maliciousserver:

1. Perform traffic analysis

2. Drop or modify the message

3. Send wrong response back

4. Impersonate the client

If server choose to be a circumventive server, then this server must complete challenge response..

Server-side vulnerability analysis:

SOCKS5

Application

Censored Domain

Proxy

Client

Tribler

Node

Proxy

Server

IPv8 

Node

IPv8

Network

TCP

Target Web ServerClient Node/

Entrance Node

Server Node/

Exit Node

UDP

Message

Evil Client Server

Censored Domain

Behavior of an evil client:

Block those servers return the

correct response in challenge

response

?Message

Malicious Client Server

Censored Domain

SOCK5

Server

PT

Client

Proxy

Server

PT

Server

Reflector in thefront domain

Censored Domain

Client Node/

Entrance Node

Server Node/

Exit Node

Network

Reactions of a maliciousclient:

Block those servers return the

correct response in challenge-

response authentication

Figure 4.7: Challenge response mechanism

The malicious server node can also cheat in mining in order to get more tokens,e.g., in the time-based mining model, a malicious server can adjust system theclock faster. In bandwidth or network traffic model, it is easy for the server toforge the bandwidth or packets it relays.

Data privacy

Another threat is that the server node can manipulate data that comes from theclient node.

Figure 4.8 shows how the malicious server in the uncensored domain violatesits roles. First, by knowing almost information of client node, for example, the IPaddress, port number, and messages, the server node can perform traffic analysisso that the client node can lose its privacy. In addition to that, the server can alsolaunch Man-in-the -middle attack (MITM) in multiple ways, e.g., stop forwardingor drop the message, and modify the message in some unencrypted case.

Blocked by the GFW

Client

GFW

IP packets

Censored Domain Filtered

Client

GFW

TCP Segment

RST/ACK

Censored Domain

Web Server

RST/ACK

Keywords Detected

Client

GFW

DNS Query

Forged DNS Reply

Censored Domain

Web Server

Keywords Detected

Ali DNS

Correct DNS Reply

Client

GFW

DNS Query

Forged DNS Reply

Web Server

Keywords Detected

Correct DNS Reply

Censored Domain

DNS Query

Forged DNS Reply

Poisoned Cache

Web Server

1. Challenge

4.Wrong Response

2. Challenge

Client Evil Server

Censored Domain

?Message

Client Malicious Server

Censored Domain

MITM: 1. the reason why to use hidden service since they do not know the orignal

client, so evil server in this case only know the message/url but do not know who

actually send the message

2,3,4, the reason why to use SGX with trustchain

challenge-response: the method to define whether the server act as the role correctly.

Reactions of a maliciousserver:

1. Perform traffic analysis

2. Drop or modify the message

3. Send wrong response back

4. Impersonate the client

If server choose to be a circumventive server, then this server must complete challenge response..

Server-side vulnerability analysis:

SOCKS5

Application

Censored Domain

Proxy

Client

Tribler

Node

Proxy

Server

IPv8 

Node

IPv8

Network

TCP

Target Web ServerClient Node/

Entrance Node

Server Node/

Exit Node

UDP

Message

Evil Client Server

Censored Domain

Behavior of an evil client:

Block those servers return the

correct response in challenge

response

?Message

Malicious Client Server

Censored Domain

SOCK5

Server

PT

Client

Proxy

Server

PT

Server

Reflector in thefront domain

Censored Domain

Client Node/

Entrance Node

Server Node/

Exit Node

Network

Reactions of a maliciousclient:

Block those servers return the

correct response in challenge-

response authentication

Figure 4.8: A malicious server node

45

4.4.2 Solutions for threats

Trustchain solutions for single point of failure

Since MultiProxy is a distributed proxy system, it does not use the centralizedtransaction servers because of the weakness of the centralized systems such assingle point of failure, performance bottleneck and trust problem mentioned insection 4.1. The solution for these problems is using the Trustchain protocol[31]as a distributed ledger and record transactions for nodes. This section gives a shortintroduction to the Trustchain protocol.

The Trustchain protocol is an open, distributed ledger which is designed to re-cord transactions and build trust between involved parties. Unlike the Blockchain,in the Trustchain protocol, each node maintains its personal blockchain initializedby a genesis block. Figure 4.9 shows the structure of personal blockchain. Oncea node creates a new transaction inside a half block, it firstly signs its half block,then sends the signing request to another node and store the results back if theblock is verified and signed by the other peer. It is resistant to data modification byan append-only ordered list. The attestation reports are stored in the Trustchain.

Trustchain can cope with attacks, e.g., it can detect the Sybil attack due to itsdifferent data structure, i.e., a block is valid only when it signed by both parties. Agroup of malicious nodes can only form their clusters, and the clusters without thetrust from outside can be easily distinguished.

Compared to the Blockchain protocol, the Trustchain protocol can better fit inthe MultiProxy, because it is not necessary to wait for all nodes in the networkreach a global consensus state.

Mining models and transaction methods

The mining models of MultiProxy can be based on many metrics, such as time andthroughput.

• Time: Earn or spend tokens according to the service running time.

• Throughput: Earn or spend tokens according to the throughput during a cer-tain amount of time.

From the analysis in section 4.4.1, to prevent tokens from malicious use, thenumber of tokens is considered as the sensitive data in the MultiProxy system.MultiProxy nodes keep their balance and income in the database and do not ex-change tokens directly to each other. Instead, they must complete Intel SGX attest-ation, which will be introduced in the next section. The goal is to prevent nodesfrom modifying and cheating the amount of service during the mining processing.

46

Figure 4.9: Trustchain protocol[20]

47

There are two approaches for clients and servers to exchange tokens. The firstapproach is to let the server proxy checks and verifies the correctness of the trans-action record inside the half block. This approach can let the server knows theidentities and actual behaviors of the clients. For instance, the server sets timersor bytes counters once the client is connected. The drawbacks are evident sincethe connection records for clients increase the overhead of the server, especiallywhen the connected clients reach the connection limit. With regard to the client,this approach is not secure considering the privacy issues since the server knowsits behaviors. In the second method, instead, the client and server set their timersor byte counters to record its own expense or income. In such cases, the serverdoes not record or track the client activities, so it knows nothing about the client.This not only prevents the privacy of the client but also reduce the server side codecomplexity and performance overhead.

The MultiProxy uses the second approach that each agent records the total amountof tokens it earns or spends inside its blockchain. However, this is not enough tosolve the trust problem between the client and server since nodes can lie about thetoken numbers they have. For instance, a server can modify the source code andclaim that it earns 100 euros even if it never forward the network traffics. There-fore, the system needs another mechanism to ensure that nodes cannot forge thenumber of works.

Keeping the servers honest with Intel SGX remote attestation

In order to solve the trust problem between the client and server, MultiProxy is ex-ecuted under the protection of the Intel Software Guard Extensions (SGX), whichmeans the selected code is running inside the safety zone without any modifica-tions. This subsection gives a short introduction about how Intel SGX prevents thesource code from modifying and disclosing by untrusted parties.

Intel Software Guard Extensions technology uses a set of special CPU instruc-tions to protect the sensitive data. It requires applications divided into two com-ponents: trusted component and untrusted component. Only the code inside thetrusted component, which is also known as the enclave, can access to these privatedata. The remaining part runs in the untrusted component. The code and data inthe Enclave are placed in a dedicated area called Enclave Page Cache(EPC), andthis area is encrypted by the Memory Encryption Engine(MEE). Only encrypteddata can be observed if some processes externally read the main memory.

With the protection of the enclave, the code in the same application can be trus-ted, but these secrets are still vulnerable during transit. In order to verify whetherother applications in local machines or remote machines run in the enclave, IntelSGX provides two types of attestation, the local attestation, and the remote attest-ation. The local attestation runs when multiple enclaves in the local applications

48

need to cooperate on the same task, they use attestation among each other and getthe session key for securely transferring and copying data. The remote attestationsends attestation to the third-party server and gets attestations back. The privatedata in an enclave are encrypted when it enters to the untrusted component. Thisis also called data sealing. The encryption key can be either enclave identity orsealing identity, depending on the specific application requirements. Figure 4.10shows the client establishes an authentication channel by the remote attestation.

Figure 4.10: Intel SGX Remote attestation[19]

In MultiProxy, both client node and server node cooperate by forwarding re-quests and responses, but in order to prevent cheating, all the nodes need the veri-fications that the token mining part runs appropriately. Therefore, the code of min-ing model in MultiProxy is running inside the enclave as the trusted component.This part of the code cannot be modified by any programs except the untrusted partwhich creates the enclave. The process ensures the node itself cannot forge theamount of work by modifying the source code or changing the results during theprogram execution time. In addition to that, the system requires some remote attest-ation infrastructures to provide a way for attesting the data in transit and buildinga secure channel between nodes.

One limitation of the Intel SGX is the particular hardware requirement: it onlyruns in sixth or later generation Intel core processor with Intel SGX-enabled BIOSsupport, which means the MultiProxy does not provide the trust environment forsome old devices.

SCONE: trusted execution of containers

SCONE[2] is a Linux container based platform which utilizes the Intel SGX tech-nology. It lets applications run in a secure fashion just as the Intel SGX. For in-stance, SCONE can prevent malicious nodes from reading the main memory. Inaddition to that, SCONE has the abilities to prohibit adversaries who have the rootprivileges to load the main memory, or access by the operating systems, hypervisor

49

Figure 4.11: Intel SGX Remote attestation full work flown[19]

50

and cloud provider, the Intel SGX supports this.

SCONE provides the low-performance overhead by supporting the asynchron-ous calls and user-level threading. The API can transparently encrypt and decryptthe I/O data such as configuration files, environment variables, and command linearguments. It verifies the correct code is running before it passing any config-urations into the application. This is ensured by the local and remote attestationand configuration service, but unlike the Intel SGX, the attestation of SCONE isentirely transparent for the application.

MultiProxy uses SCONE as the running environment for three reasons. In thefirst place, both Intel SGX and SCONE provide the remote attestation. SCONE hasits attestation server called SCONE global Configuration and Attestation Service(CAS). This infrastructure can issue the same SSL certificate to all the end nodesto ensure the token mining code of MultiProxy in the different machines run inthe protection of the enclave. Another consideration is implementation complex-ity. The Intel SGX SDK is based on C/C++ programming language. For instance,figure 4.11 depicts the full work flow of the Intel SGX remote attestation, and thisinvolves numerous message transmission. Considering that MultiProxy is writ-ten with Python, it exists some incompatibilities and needs additional efforts tocombine these two languages. SCONE supports the most popular and mainstreamlanguage such as Java, Rust, Go and Python. Therefore, Switching to SCONEis relatively easy because the source code of the application does not need to bemodified. Finally, SCONE can be used on top of the containers such as Docker,and the standardization of Docker makes the MultiProxy easy to build, deploy andtest on various platforms. Therefore, SCONE is used for implementing the tokeneconomy in MultiProxy, and it can prevent the malicious nodes from modifyingthe source code.

Solutions for token economy

By combining the Trustchain protocol with SCONE together, the solution for build-ing a distributed trusted token economy without cheating is following: Assumingthe most basic situation of two nodes, a client node, and a server node, they bothrun the program on top of the Intel SGX. In the beginning, they perform remoteattestation to ensure both endpoints are running the same, legitimate code insidethe enclaves, and they both get the certificate issued by CAS(Configuration andAttestation Service). The certificate will be stored inside every transaction on theblockchain and be sent to its peer in periodical intervals. The peer verifies andsigns the transaction only if the global CAS issues the certificate.

51

4.5 Multi-hop messaging

4.5.1 Solutions for data privacy

According to the threat analysis in section 4.4.1, the information of request ori-ginator need to be protected to avoid malicious behaviors of the exit node, e.g.,traffic analysis. One way to protect privacy is restricting the knowledge of nodes,i.e., there is no node knows the full information of a network. Multi-hop messagerouting is a countermeasure for privacy protection since it prevents the identities ofrequest originators.

End-to-end anonymous communication

Anonymous communication is used for privacy protection. It can be achieved bybuilding the data transmission circuits, which means nodes arranged in differentpaths. Onion routing is a way to wrapping and encrypting messages in successivelayers through the network which has multiple intermediate nodes. It can well pro-tect the privacy of originator since the messages are encrypted sending in betweenand no intermediate node inside the circuit can tell the source and final destinationof the message, except the exit node, can determine itself as the last hop. In theMultiProxy system, messages are protected by the onion routing mechanism. Therequest originators can specify how many hops they want to build. Then it starts tobuild circuits between a list of nodes it can be achieved by using asymmetric keycryptography, e.g., Diffie-Hellman key exchange algorithm, to negotiate differentsession keys through the network. The server nodes are declared themselves as theexit nodes and become the last hops to forward the data to the target web server.Once the circuit is built, data will be forwarded inside the circuit. Figure 4.12shows a 2-hop circuit from the client node to the exit node(server node). In idealsituations, the client node and the relay node should both in the censored domainso that the exit node cannot distinguish which one is the actual request originator.

client node

relay nodeexit node

Censored domain

target server

Original messageSecond encryption layer

First encryption layer

Figure 4.12: A 2-hop onion routing circuit

52

4.6 Implementation details

MultiProxy1 is built on top of IPv82. IPv8 is a Python implementation librarythat provides multiple communities to ensure authenticated communication andprivacy. Those communities including peer discovery, attestation, DHT, anonym-ization, and the Trustchain. MultiProxy reuses the anonymous messaging and theTrustchain communities as shown in Figure 4.13.

Community

TunnelCommunity TrustchainCommunity

MultiProxy

MultiProxyInitiator MultiProxyForwarder

MultiProxyClient MultiProxyServer

Figure 4.13: The class UML diagram of Multiproxy

4.6.1 Traffic forwarding

HTTP and HTTPS are currently the common World Wide Web data communic-ation protocols based on TCP, and the latest version of IPv8 only has the UDPendpoint implementations. Therefore, a TCP endpoint interface is written to for-ward the basic web traffic using twisted, which is a Python event-driven networkingengine. Different from other client-server based implementation, MultiProxy splitthe system into three parts: Initiator, Forwarder and Server. Figure 4.13 showsthe inheritance relationship between the different components. Both Initiator andForwarder are inherited from MultiProxyClient because some code can be reused,e.g., serve as the SOCKS5 server.

1https://github.com/nyannko/proxy2https://github.com/Tribler/py-ipv8

53

The class implementation is illustrated in the Figure 4.14. Each component hastwo classes, and these two factory classes act as both the server and client to handlethe bidirectional connections. MultiPorxyInitiator acts as a SOCKS5 server to re-ceive data sent from the applications, and then pack these data into the self-definedpacket which is shown in figure 4.15, and finally send it to the selected node. Mul-tiProxyForwarder only changes the header of the received data and forward to itsselected nodes. MultiProxyServer can unpack the true target server address insidethe request initiator and build the TCP connection with the target website. Theclient factory class of each component forwards the data back to the browser.

browser Target web server

SOCKS5Factory

ClientRemoteFactory

MultiProxyInitiator

ForwarderFactory

ForwarderRemoteFactory

MultiProxyForwarder

SserverFactory

ServerRemoteFactory

MultiProxyServer

SOCKS5 TCP TCP TCP

Figure 4.14: Components and work flow of MultiProxy

Circuit ID Data length Data type

Message body

4 bytes 4 bytes 4 bytes

TCP header

20 bytes

Figure 4.15: Packet structure

Since TCP is a stream-oriented protocol, the data from the origin may be brokenup to chunks with arbitrary size, and there is no way for the receiver to recognize ordetermine the original data size. Therefore, the sender should attach the messagelength inside the header, and the receiver parses the TCP stream and extract themessage body.

4.6.2 Token economy

Two token mining models mentioned in section 4.4.2 are implemented in function1 and 2. The function COUNTTIME creates the transaction by calculate the timeintervals. The function COUNTBYTES calculated the sum of bytes length fromeach connection. Both of functions pack the certificate issued by SCONE insidethe transaction.

54

The function SENDSIGN describes that the sign requester sends its unsigned halfblock after choosing mining model to calculate the transactions. The SHOULD-SIGN is shown in function 4, and it contains the attestation and verification forSCONE certificate. This verification could be implemented by trying to connectby HTTPS protocol. Currently, the implementation of MultiProxy is run in debugmode without using attestation since SCONE CAS is currently not available.

Algorithm 1: COUNTTIME

Input: identity, certificate, balance, debit, creditOutput: A transaction contains the current balance

1 initialization2 timeStamp← current time3 balance← balance + debit- credit4 transaction←

createTransaction(timeStamp,identity,balance,certificate)5 return transaction

Algorithm 2: COUNTBYTES

Input: A client factory that manages connections, identity, SCONEcertificate

Output: A transaction contains the size of bytes1 initialization2 timeStamp← current time3 totalBytes← 04 for connection in clientFactory do5 totalBytes← totalBytes + getByteSize(connection)6 transaction←

createTransaction(timeStamp,identity,totalBytes,certificate)7 return transaction

4.6.3 Anonymous messaging

This section introduces the circuit build process of the Tunnel Community. Figure4.16 shows the basic idea behind the IPv8 circuit creation. Various messages aresent from nodes when they establish a circuit. The request originator, first specifieshow many hops it wants to build, then it requests the next hop by the “create”message, once hop 1 receives the message, it generates the shared session key andsend back a “created” message to the originator. The originator then sends the“extend” message to hop 1 and hop 1 forwards the “create” message to hop 2.When hop 2 receives the message, and generate a new session key. Then it sendsback a “created” message to hop 1. Hop 1 sends the “extended” message back to

55

Algorithm 3: SENDSIGN

Input: The mining model function, a list of known peersOutput: The callback shows succeed or failure

1 initialization2 randomPeer ← randomly select a peer3 randomPeerPubKey ← the public key of randomPeer4 transaction← miningModel()5 callBack←

signBlock(randomPeer,randomPeerPubKey,transaction)6 return callBack

Algorithm 4: SHOULDSIGN

Input: The unsigned blockOutput: A boolean value whether the signer should sign the block

1 initialization2 certificate← unpack(transaction)3 result← verify(certificate)4 return result

the initiator. The process repeated when it reaches the exit node. Each node in thecircuit only knows the current hops and minus one before sends it to the next hop,so each intermediate hop cannot distinguish the source of the messages and finaldestination since they all receive the same messages during the period of buildinga circuit. The encrypted data transfers securely with multiple session keys betweennodes after they build the circuits.

Figure 4.16: Build a 2-hop circuit[37]

56

Chapter 5

Evaluation

This chapter presents a comprehensive evaluation of MultiProxy. The goal of theevaluation is to examine how well MultiProxy can unblock the Great Firewall andits overall performance compared to other censorship circumvention systems. Inthe first step, the systematic metrics are defined for the evaluation framework, andthis is described in section 5.1. Furthermore, the methodologies, as well as theexperimental steps, are developed based on the evaluation framework in section5.2. The measurement results and analysis shows in section 5.3.

5.1 Evaluation framework

In order to measure the performance of MultiProxy, an evaluation framework is de-veloped as shown in Figure 5.1. The overall performance of MultiProxy is brokendown into two dimensions: network performance and system performance.

5.1.1 Network performance

The network performance can be considered as the most important part of the eval-uation framework. It aims at measuring the service quality of MultiProxy.

Network performanceLatencyThroughput

System performanceCPU usageMemory usage

Table 5.1: Evaluation framework

57

Latency

Network latency, which is also known as end-to-end latency, refers to the total timethat a packet takes from the source to the destination. The network latency canbe influenced by multiple reasons caused by any transfer object between the startpoint and end point. For example, the LAN, WAN or path from ISP to host. Oneintuitive way to measure latency is to compute the transmission time by subtractingthe starting time from the arrival time in different nodes. However, since it is anontrivial problem to synchronize the clocks among different devices, the Round-trip time (RTT) is usually used to measure the latency between two nodes. Round-trip latency means that all time comparisons are made from the same node. Sincethe network latency changes frequently, one way to get the accurate latency is torepeat the measurement throughput the day in regular time intervals and get themaximum, minimum, average and standard deviation(jitter) of the RTT.

Throughput

The network throughput means the amount of successfully delivered data overthe communication channel per unit of time. It is given as bits per second(bps).Throughput can be evaluated by computing the divide the total bytes that one nodereceived by the RTT. However, the actual throughput is usually lower than the the-oretical bandwidth of the medium due to various additional factors, such as theconcurrent use of the network, congestion and flow control.

5.1.2 System performance

Although nowadays most computers have powerful CPU processors and large memorycapacity, system performance is still an important factor for user experience. CPUusage and memory usage is evaluated during the usage of MultiProxy.

CPU usage

The CPU usage indicates the proportions of the CPU cycles dedicated to runningthe particular program. The experiment measures the CPU utilization, which is theCPU time divided by the process running time, and expressed as a percentage.

Memory usage

Memory usage refers to how much physical memory that a program used in itsrunning period. The evaluation uses the physical memory that a process occu-pied instead of the ratio between the resident set the size to the machine physical

58

memory since the ratio will close to zero if the machine has a large amount ofmemory.

5.2 Methodologies and experimental steps

The evaluation framework in section 5.1 is applied to methodologies and experi-ments. The evaluation of system functionality consists of two parts. In the firstpart, the performance of MultiProxy is measured under different server locationsetups. In the second part, the effectiveness of MultiProxy is measured by compar-ing the performance with other representative censorship circumvention systems toget its relative performance. The scalability test in the last subsection declares theperformance with multiple concurrent nodes.

5.2.1 System performance

This section describes the detailed experimental steps of measuring pure systemperformance. The server’s location is considered as the main influence factor ofMultiProxy. In order to get both ideal performance and practical performance.MultiProxy is deployed under two different situations: The first case is the over-provisioning case that provides enough capacity for network traffic transmission.Therefore, the network traffic are supposed to have low latency and relatively highthroughput with low packet loss rate. The ideal performance of MultiProxy can bemeasured under the first case, while the second case has more uncertainty factors,e.g., how much resources or buffers that the intermediate routers can hold.

The location selection is listed as following:

• All nodes are deployed inside the same cluster of DAS5.

• The nodes are deployed among the different cloud instances on a globalscale.

The Distributed ASCI Supercomputer 5 (DAS5) is a six-cluster wide-area distributed-system which provides a common computational infrastructure for parallel and dis-tributed tasks. The head node of each cluster manages tasks by using the SLURMbatch queueing system. Each node for experiments has two eight-core Intel XeonE5 CPUs(2.40GHz) with Intel Hyper-Threading technology which allows the singlephysical core processors to behave as like two logical processors, so each node has32 logical processors in total. The memory size is 62GB. For network configura-tions, MultiProxy makes use of InfiniBand network which offers up to 48 Gbit/s totransfer data. Since some user or registered ports are filtered by the firewall, Multi-Proxy runs in the different head nodes cannot directly connect, so the experiment ofnodes located in the different clusters cannot be completed, only the performance

59

test of nodes which are located in the same cluster are performed. The operatingsystem is CentOS 7 in DAS5.

As for the global service instances selection, the client is the Alibaba cloud in-stance located in South China. The candidate internal hops and the exit nodes areall Google Cloud instances, which are located in the Netherlands and South Car-olina, USA separately. These cloud instances use the high-performance premiumnetworking tier, and the network topology is shown in Figure 5.1, The client inChina is first connect to one cloud instance located in the Netherlands, this instanceacts as the client and forward the data to the other instance in the Netherlands, andfinally reach the exit node in the USA. Each global cloud instance employed forthe test has one single-core Intel Xeon CPU, and the logical processors are alsoone. The operating systems of these cloud instances are CentOS 7.

Figure 5.1: Premium networking topology of Google Cloud instances[21]

5.2.2 Performance Comparison

Comparable research is conducted to measure the practical performance of Mul-tiProxy. It is compared to the other representative censorship circumvention sys-tems. These systems are divided into three categories according to their protocolsand architectures. For each category, the most popular and representative systemsare chosen. The first is the proxy solution, which works in the transport layer, in-cluding Shadowsocks and V2Ray. The other is the Virtual Private Network (VPN)solutions perform at the network layer, including OpenVPN and OpenConnect.The last is the I2P solution, and its typical application is Tor. The overall perform-ance is measured by using the evaluation framework described in section 5.1.

Different CRSes are tested by using different encryption and accessing methods.The Shadowsocks is configured as AES-256-CFB encryption method to protectthe data transmitted between the proxy client and server, V2Ray utilizes the self-defined communication protocol called VMess. For VPN solutions, both Open-VPN and OpenConnect make use of AES-256-GCM encryption and LZ4 datacompression method based on Datagram Transport Layer Security (DTLS). Tor

60

exploits the meek pluggable transport to discover available bridges, as well as theonion networking to transfer data. All CRS servers including proxy servers andVPN servers are deployed in the US instance and act as the exit node except Torsince Tor use its onion routing settings. For MultiProxy, China instance with USinstance composed as a one-hop setup. Two-hop setup consists of China, Neth-erlands and US instances, while three-hop setup including one instance in China,two instances in the Netherlands and one US instances.

In order to measure the performance of systems, the command line tool curl isacted as the client to calculate some data, e.g., the data transmission time.

5.3 Results and analysis

This section presents the results and analysis of the evaluation. The first sectionshows the network performance of MultiProxy. The latency and throughput com-parison of different censorship circumvention systems reveals in subsection 5.3.2.Subsection 5.3.3 gives the system performance including the CPU and memoryusage. The last subsection shows the result of the scalability test.

5.3.1 Network performance

The latency within MultiProxy is measured by calculating the message sendingtime between arbitrary two nodes. The result shows that RTT achieves 2.815 mswith a standard deviation of 0.5 ms. In the next step, to measure the real net-work performance of MultiProxy, the top five blocked websites from Alexa Top500 Global Sites are selected as the target servers because of its popularity, thesewebsites including Google, Instagram, Twitter, Facebook, and YouTube.

The first experiment measures the latency in the same DAS5 cluster. Figure 5.2compares the latency of the same DAS5 cluster when clients visit different targetweb servers. The point in the middle indicates the mean of the measurement andthe vertical line represents the standard deviation of the page download time. Hopzero means the proxy client and server are both runs in the same host.

The result shows that the latency inside of the DAS5 same cluster is slightlyincreased as the hop numbers growth. The latency of Google obtains the least re-sponse time, and the latency of Instagram, Twitter, and Facebook following closely,while the latency of YouTube is the highest due to the websites transmit moredata compared to others. The latency and standard deviation are increased due tothe network situation and proxy processing time. Although the result is slightlyincreased with the growth of the hop number, in absolute values the differencebetween using multiple hops and directly connecting to the target servers withoutusing proxies is small and likely outside of what a user can perceive. The result

61

shows that under an ideal environment, the latency of MultiProxy is close to thelatency that directly connects to the server while protecting the privacy of the usermuch better.

without proxy 0 1 2 3Hops

400

600

800

1000

1200

Late

ncy

(ms)

397 418 427 432 446

501 507 514 518 521

646 649 652 665 669

1136 1140 1140 1142 1155

Latency with different hop length

GoogleInstagramTwitterFacebookYouTube

Figure 5.2: Latency with different hop length

In the global latency measurement, the target servers and accessing methods arestill the same as the first experiment, and the result shows in Fig 5.3a. The perform-ance among different websites is close to the local latency test, accessing Googlewith the fastest response while YouTube is slowest. The latency and standard de-viation of Tor are relatively high compared to others due to the limited number ofdomain fronting servers and bridges, or the unbalanced number of the client nodesand exit nodes. It needs 10 to 35 seconds on average to access the homepages whilethe latency of other competitors is under 4 seconds. Because of the high latencyand poor performance of Tor, it makes the graph unreadable concerning other CRSsystems.

Figure 5.3b omits the Tor results and focused on the other systems. The resultsindicate that the VPN solutions are slightly slower than regular proxy solutions,that is because VPN works on the network layer and they try to encrypt and com-press the whole IP packet transmitted in between while the proxy solutions work onthe transport layer, which can categorize different network traffic instead of deal-ing with the whole IP packet, e.g., they only encrypt the body of the TCP streamor UDP datagram. Another reason is that proxies are more flexible than VPNsolutions. For example, they can build the blocklist and only encrypt or compressthe network traffic when the destination address is blocked. The performance ofone-hop MultiProxy shows in the third place on the left is in the middle of proxysolutions and VPN solutions, and the latency increase with more hops due to the

62

different network situations, the one-hop setup is more stable than two hops andthree hops. MultiProxy still obtain quick access compared to Tor. One reasonmentioned before is that Tor has a small range of domain fronting servers and exitnodes compared to its users or client nodes, and the token economy can solve thisin MultiProxy since all exit nodes get paid while providing the circumvention ser-vices. The economy system can promote the number of service providers, andthis provides a better network environment and performance in the latency andthroughput.

Shadowsocks V2Ray MultiProxy OpenVPN OpenConnect Tor0

5000

10000

15000

20000

25000

30000

35000

Late

ncy(

ms)

Latency under different methods

GoogleInstagramTwitterFacebookYouTube

(a) Latency measurement

Shadowsocks V2Ray MultiProxy OpenVPN OpenConnect0

500

1000

1500

2000

2500

3000

3500

4000

4500

Late

ncy(

ms)

Latency under different methods

GoogleInstagramTwitterFacebookYouTube

(b) Latency measurement without Tor

Figure 5.3: Latency measurement

Besides the latency tests, the throughput test is also performed to test the down-load speed. The client downloads the 10 MB speed test file which is located inthe same instance with the server side of CRSes. The result shows in Figure 5.4,it compares the different throughput. Shadowsocks has the fastest download speedwith 2745KB per second. The result of Tor 17KB per second is the lowest. It con-firms the latency test before, because Tor has the highest latency and response time.Both OpenVPN and OpenConnect are slower than proxies. The one-hop setup ofMultiProxy obtains a similar throughput as Shadowsocks and V2Ray. The res-ult of MultiProxy two hops and three hops decreased due to the complex networksituations.

5.3.2 System performance

This section shows the result of the system performance including CPU usage andmemory usage. The experiments are conducted in DAS5. As for the methodolo-gies, the CPU performance of the client, forwarder, and server is measured sep-arately. In the first situation, MultiProxy establishes a 2-hop circuit including oneclient, one forwarder and one server without starting to provide the circumvention

63

0 500 1000 1500 2000 2500 3000Throughput (KB/s)

Shadowsocks

V2ray

MultiProxy1

MultiProxy2

MultiProxy3

OpenConnect

OpenVPN

Tor

2745

2508

2496

1776

1589

2038

2220

27

Throughput Comparison

Figure 5.4: Throughput measurement

services. Figure 5.5 presents the CPU utilization variations, and Figure 5.5a showsthe CPU usage with y-axis range from 0 to 100 percent, which means the CPUconsumptions during the whole MultiProxy process is no more than 6%. Figure5.5b depicts the detailed information with a smaller range of y-axis, The value isaround 4.5% at the earliest stage, because of the initialization workload, and stead-ied falling to 0.5% as the time goes on. In the second case, the curl is employedas the request originator, and the starting point of the CPU usage is slightly higherthan the first case, which is around 5%, and the overhead is increased for all nodes,especially the server and client, and the usage is around 3% to 4% while the per-formance of the forwarder still stabilized at 0.8%.

The memory test measures the mean value and standard deviation during aperiod of time before and after MultiProxy providing the circumvention service.The memory here refers to the physical memory. As shown in the bar chart 5.6, theperformance is relatively stable without providing the circumvention service, andthe memory usage is around 27 MB. In the second case, the overhead for client andserver is slightly increased to 29 MB while the forwarder still maintains at the samelevel as before. This is because the forwarder has a lighter workload compared tothe client and server.

64

0 15 30 45 60 75 90 105Time (s)

0

20

40

60

80

100

CPU

usag

e(%

)

CPU usage of each component in Multiproxy

ClientForwarderServerClient (curl)Forwarder (curl)Server (curl)

(a) CPU usage with 100 percentage

0 15 30 45 60 75 90 105Time (s)

0

1

2

3

4

5

6

CPU

usag

e(%

)

CPU usage of each component in Multiproxy

ClientForwarderServerClient (curl)Forwarder (curl)Server (curl)

(b) Throughput with different node number

Figure 5.5: CPU usage (%)

Before circumvention After circumvention0

5

10

15

20

25

30

Mem

ory

usag

e(M

B)

27.49

29.69

27.49 27.4927.49

29.55

Memory usage of each components in MultiProxy

ClientForwarderServer

Figure 5.6: Memory usage (MB)

65

5.3.3 Scalability test

This section describes the scalability test. The goal of the scalability test is tomeasure the system capacity to scale up or scale out. It evaluates the particularmetrics with the workloads increment. The latency and throughput in the networkperformance are chosen as the metrics for evaluating how many nodes the networkcan hold without remarkable performance degradation. A network with one toone client-server ratio is measured. In this network, every client builds a one-hopcircuit by randomly choosing a server as the exit node and then sending the requestto Google.com.

The scalability experiment is run in DAS5. Up to 50 nodes are hosting andrunning in each physical node in the cluster. That is feasible since as mentionedin section 5.2, each physical node has 32 logical processors, and MultiProxy doesmany network I/O operations, so the CPU can schedule the I/O operation waitingtime of one thread for other threads. Therefore, every logical processor can runmore than one thread at the same time. The experiment measures 500 nodes intotal because not all the machines in DAS5 clusters are available at the same time.

Figure 5.7 shows the results of the latency and throughput performance of thenetwork. The latency achieves around 0.52 s when the node number is under 150and has a slight increase between 150 to 300 nodes. From 300 to 500, the latencyincrease to 0.68 s and the corresponding throughput decrease to 0.5 KB/s.

1 50 100 150 200 250 300 350 400 450 500The number of Nodes

0.0

0.2

0.4

0.6

0.8

1.0

Late

ncy

(s)

Latency with different node number

(a) Latency with different node number

1 50 100 150 200 250 300 350 400 450 500The number of Nodes

0.0

0.2

0.4

0.6

0.8

1.0

Thro

ughp

ut(K

B/s)

Throughput with different node number

(b) Throughput with different node number

Figure 5.7: Scalability test

66

Chapter 6

Conclusions and Future Work

6.1 Conclusions

This chapter gives the conclusions and recommendations for the research on theGFW. Section 6.1.1 concludes the research questions and highlights the main con-tribution of this thesis. Future work is described in section 6.2

6.1.1 Results for each research questions

RQ1: What are the current content blocking techniques?

There are four categories of Internet censorship, client-side censorship, server-sidecensorship, in-path censorship, and on-path censorship. This thesis selects oneof the famous country-wide censorship monitoring and surveillance systems, theGreat Firewall of China as the case study, so it mainly focus on in-path and on-path censorship. The GFW uses multiple modules to check and filter the undesirednetwork traffic.

The GFW has multiple levels of blocking. The blocking does not take actionsin physical layers, since it is the fundamental layer underlying the higher layersand provide the physical connection requirement. In-path censorship works on thenetwork layer, which provides the means of transferring network packets from thesource to destination. IP blocking is the primary mean to intercept the traffic. Itcompares the destination in the packet with the access control list inside the inter-national gateways, and by using the BGP protocol, the undesired network trafficis redirected to some random addresses or discarded. IP blocking is a simple andstraight way to blocking the network traffic because the router participate directlyin the routing process.

67

The on-path censorship, instead, does not directly involved in the routing pathbetween two endpoints, it makes a copy of all the network traffic, then monit-ors besides the gateways and performs man-in-the-middle attacks. It works in thetransport layer which provides the host-to-host application communication. Themain protocols in the transport layer are TCP and UDP protocols, the GFW stillusing the keyword detection for higher application layer to decide whether thetraffic should be blocked. For instance, the HTTP/DNS keyword detection. TheTCP connection reset is used in decision phrase, that is, the GFW intercept theconnections by sending the RST packets to both endpoints after TCP three-wayhandshake. This kind of man-in-the-middle attack is stateful, and the GFW canremember the state of connections for 90-95 seconds, which means if the clientwants to reestablish the connection between the server, the handshake requests areimmediately eavesdropped and blocked by the GFW, and the server-side even doesnot know the existence of the client. In this way, blocking becomes more effectivesince it only sends RST packets to the client side.

The GFW mainly adopt two methods for DNS over UDP protocol: DNS hijack-ing and DNS poisoning, and DNS poisoning is the side-effect of the former. DNShijacking happens when client sends requests to DNS servers in the uncensoreddomain. It works by wiretapping in the middle, constructing the bogus DNS re-sponse packet with wrong resolution answer inside. Since the GFW has less rout-ing hops, this wrong address will be accepted by the client, and the correct answerarrives later is dropped by the client-side TCP/IP stack. DNS poisoning obtains thebroader negative impact compared to DNS hijacking. It happens when domesticDNS server sends requests recursively, and the GFW performs DNS hijacking andinjects the wrong answers into its cache. These wrong answers can stay for severalminutes or days. As for DNS over TCP, the GFW uses TCP connection reset.

The experiments in chapter 3 shows the proportions of different blocking tech-niques, although the TCP connections have the most widespread impact, the mostpopular websites such as Google, Instagram, Twitter, Facebook, and YouTube areIP-blocked by the GFW since they have numerous and consecutive range of IP ad-dresses. For the DNS hijacking, the wrong answers always point to the randomaddresses of the service providers, such as Facebook and SoftLayer.

RQ2: What are the current anti-censorship methods?

The most prevalent censorship circumvention systems are client-server based ar-chitecture, including the VPN solutions and the proxy and the proxy solutions.

The VPN works on the network layer. It works by establishing the end-to-endvirtual connections through the use of the dedicated circuits, and the IP packets areencrypted through the whole connections.

The proxy works on the transport layer. Proxy is weaker than the VPN on the se-

68

curity aspect, but it is more flexible, since it only processes the streams/datagramsin the higher layer and does not need to encrypt the whole IP packet. It can distin-guish different network traffic. Through creating the blacklist or white list, it canquickly decide which kind of network traffic should be encrypted or not.

Both the VPN and the proxy take measures to protect the messages transferson the fly. VPN focus more on encryption, such as the authentication messagesor certificates exchange. Different VPN protocols have different implementations.They can leak specific network flow features through the transmissions. Althoughthe encryption can easily resist eavesdroppers in the middle, this does not work forthe GFW, since the GFW aims at blocking the network traffic but not decrypting it.This is the reason why some VPN protocols are unstable or blocked by the GFW.

The proxy use obfuscation methods to evade the monitoring of the GFW, itsnetwork traffic is more like the normal network flow between two endpoints. Theproxy generally does not encrypt the header of the network layer and transportlayer protocols, but only tunnel the obfuscation data inside the protocol messagebody, they use multiple methods like different algorithms to encrypt or randomizethe tunneled messages. Some unusual protocols still can be distinguished suchas the SSH protocol. As a result, most proxies use the regular HTTP or HTTPSprotocols to tunnel the original requests.

These client-server based solutions are easy to deploy and use, and usually ob-tain the low latency when in personal use, but its drawback is obvious, once itobserved by the GFW, the cloud instance is immediately blocked, this will causesome loss to the user, such as service not available and they need to deploy theproxy server again from the beginning.

Other applications use a distributed anonymous network, such as Tor. The cli-ents employ some pluggable transports such as meek. This network obtains betterprivacy protection since the nodes in the network only have a portion of identitymessages, and intermediate node only knows the addresses of its previous and nextnodes. Although the last hop decapsulates the messages, it does not know therequest owner.

The distributed network utilizes more resources than standard client-server basedarchitecture. The volunteer-operated servers construct a distributed network. Oncethe GFW blocks a circuit, the application can switch to another circuit withoutdeploy again, and this avoids the single point failure of the proxy server.

Another advantage is that Tor is free, and the usage is also easy. Users can usethe Tor network by downloading the browser and use the pluggable transport in thecensored domain.

The drawback is obvious especially in the censored domain since the pluggabletransport has the limited number of the network traffic forwarder, this causes a longwaiting time for switching the circuits.

69

RQ3: How can an effective way for evading censorship be developed?

The design goal of the MultiProxy is to combine both advantages of client-serverbased and distributed architecture and overcome the drawbacks of the client-serverbased architecture, such as the single point of failure and the lack of resources.

First of all, since most popular websites are IP-blocked by the GFW, the systemneeds a basic traffic forwarding module, which is to provide the basic circum-vention service. The architecture is similar to the client-server based CRSes. Itcontains a client-side proxy to receive and process messages from the originatorapplications, and forward these messages to the remote-side proxy. Consideringthe traffic forwarding protocols, Multiproxy adopts SOCKS version 5 as the pro-tocol between the applications and the client proxy because of it is more flexibleand scalable than other protocols such as the HTTP/HTTPS proxy.

Since MultiProxy also provides the privacy for request originators, it can addmultiple layers/forwarders/hops in the path between the client proxy and the serverproxy. To ensure there are a balanced number of nodes, MultiProxy has a tokeneconomy, which uses the Trust chain protocol as a distributed ledger, and IntelSGX prevents nodes from cheating.

RQ4: How can the performance and effectiveness of censorship evasion sys-tems be evaluated?

The evaluation framework is described in chapter 5, and it evaluates the networkand system performance of the MultiProxy in the DAS5, which is the ideal en-vironment. The result could be considered as the best performance MultiProxycan achieve. Next, comparable research of circumvention systems is conducted byusing the practical and production environment cloud instance. The results showthat MultiProxy is able to evade the GFW, and it also shows the performance gapsamong different CRSes. Finally, a scalability test shows to what extent the Multi-Proxy can scale up.

RQ5: What are the lessons and recommendations for circumvention?

Since the GFW is aimed at sorting out the suspicious circumvention network trafficinstead of decrypting these messages in between. Therefore, some new or self-invented protocols even if they are not encrypted works for the circumvention aslong as the GFW does not notice them. However, obfuscation is a better solu-tion, since hiding or tunneling the circumvention network traffic into the standardunblockable protocols such as the domain fronting can increase the regulation dif-ficulties.

70

Concerning the network structure, since the MultiProxy is built on the peer-to-peer network, some methods should be employed to make the network environmentstable, in this case, is to maintain the balance between the client and server nodes.

6.1.2 Main contributions

This thesis aims at overcoming both shortcomings of client-server based and dis-tributed circumvention systems, i.e., the single point failure of the server proxyas well as the free-rider problems existed in an unregulated network. The maincontribution of this thesis is proposed, built and evaluated a circumvention systembased on the peer-to-peer network, with the token economy to balance the numberof the nodes inside and outside the censored domain, and the multi-hop messagingto protect the privacy of users.

6.2 Future Work

There are still some issues that need to be addressed. First, the system needs mul-tiple traffic obfuscation methods to evade the GFW. For example, Tor builds itspluggable transport meek. Besides, MultiPorxy will need a user-friendly and con-figurable GUI interface based on the existed command-line interface.

71

72

Bibliography

[1] Eytan Adar and Bernardo A Huberman. Free riding on gnutella. First monday, 5(10),2000.

[2] Sergei Arnautov, Bohdan Trach, Franz Gregor, Thomas Knauth, Andre Martin, Chris-tian Priebe, Joshua Lind, Divya Muthukumaran, Dan O’Keeffe, Mark L. Stillwell,David Goltzsche, Dave Eyers, Rudiger Kapitza, Peter Pietzuch, and Christof Fetzer.SCONE: Secure linux containers with intel SGX. In 12th USENIX Symposium on Op-erating Systems Design and Implementation (OSDI 16), pages 689–703, Savannah,GA, 2016. USENIX Association.

[3] Chadi Barakat, Patrick Thiran, Gianluca Iannaccone, Christophe Diot, and PhilippeOwezarski. Modeling Internet backbone traffic at the flow level. IEEE Trans. SignalProcess., 51(8):2111–2124, 2003.

[4] Cecylia Bocovich and Ian Goldberg. Slitheen : Perfectly imitated decoy routingthrough traffic replacement. Ccs, 2016.

[5] Cecylia Bocovich and Ian Goldberg. The road not taken: Secure asymmetry anddeployability for decoy routing systems. 2016.

[6] Sam Burnett, Nick Feamster, and Santosh Vempala. Chipping Away at CensorshipFirewalls with User-Generated Content. Computer (Long. Beach. Calif)., pages 29–29, 2010.

[7] Bram Cohen. Incentives build robustness in bittorrent. In Workshop on Economics ofPeer-to-Peer systems, volume 6, pages 68–72, 2003.

[8] Ziye Deng, Zihan Liu, Zhouguo Chen, and Yubin Guo. The random forest baseddetection of shadowsock’s traffic. In Intelligent Human-Machine Systems and Cyber-netics (IHMSC), 2017 9th International Conference on, volume 2, pages 75–78. IEEE,2017.

[9] Luıs Rodrigues Diogo Barradas,Nuno Santos. DeltaShaper: Enabling UnobservableCensorship-resistant TCP Tunneling over Videoconferencing Streams. Proc. Priv.Enhancing Technol., 2017(3):1–3, 2017.

[10] Frederick Douglas, Rorshach, Weiyang Pan, and Matthew Caesar. Salmon: RobustProxy Distribution for Censorship Circumvention. Proc. Priv. Enhancing Technol.,2016(4):4–20, 2016.

[11] Roya Ensafi, David Fifield, Philipp Winter, Nick Feamster, Nicholas Weaver, andVern Paxson. Examining how the great firewall discovers hidden circumvention serv-ers. In Proceedings of the 2015 Internet Measurement Conference, pages 445–458.ACM, 2015.

73

[12] David Fifield, Chang Lan, Rod Hynes, Percy Wegmann, and Vern Paxson. Blocking-resistant communication through domain fronting. Proc. Priv. Enhancing Technol.,2015(2):46–64, 2015.

[13] David Fifield, Lynn Tsai, and Qi Zhong. Detecting Censor Detection. 2017.[14] Sergey Frolov, Fred Douglas, Will Scott, Allison Mcdonald, Benjamin Vandersloot,

Rod Hynes, Adam Kruger, Michalis Kallitsis, David G Robinson, Steve Schultze,Nikita Borisov, J Alex Halderman, and Eric Wustrow. An ISP-Scale Deployment ofTapDance. pages 1–7.

[15] T. He, H. Zhang, X. Li, and Z. Li. A Methodology for Analyzing Backbone NetworkTraffic at Stream-Level. Int. Conf. Commun. Technol. Proceedings, ICCT, 1, 2003.

[16] Amir Houmansadr, Chad Brubaker, and Vitaly Shmatikov. The parrot is dead: Ob-serving unobservable network communications. Proc. - IEEE Symp. Secur. Priv.,pages 65–79, 2013.

[17] A Houmansadr, T J Riedl, N Borisov, and A C Singer. I want my voice to be heard:IP over Voice-over-IP for unobservable censorship circumvention. Ndss, 2013.

[18] Amir Houmansadr, Wenxuan Zhou, Matthew Caesar, and Nikita Borisov. SWEET:Serving the web by exploiting email tunnels. IEEE/ACM Trans. Netw., 25(3):1517–1527, 2017.

[19] John M. (Intel). Workflow of intel sgx remote attestation, 04.07.2018.https://software.intel.com/en-us/articles/code-sample-intel-software-guard-extensions-remote-attestation-end-to-end-example.

[20] Ed. J. Pouwelse. Trustchain protocol, 2018. https://tools.ietf.org/id/draft-pouwelse-trustchain-01.html.

[21] Prajakta Joshi. Premium networking topology of google cloud instances, 2017.https://cloud.google.com/blog/products/gcp/introducing-network-service-tiers-your-cloud-network-your-way.

[22] Thomas Karagiannis, Andre Broido, Michalis Faloutsos, and Kc Claffy. Transportlayer identification of P2P traffic. IMC ’04 Proc. 4th ACM SIGCOMM Conf. InternetMeas., pages 121–134, 2004.

[23] Thomas Karagiannis, Konstantina Papagiannaki, Michalis Faloutsos, Thomas Kara-giannis, Konstantina Papagiannaki, and Michalis Faloutsos. Blinc. Proc. 2005 Conf.Appl. Technol. Archit. Protoc. Comput. Commun. - SIGCOMM ’05, 35(4):229, 2005.

[24] Josh Karlin, Daniel Ellard, Alden W. Jackson, Christine E. Jones, Greg Lauer,David P. Mankins, and W. Timothy Strayer. Decoy Routing: Toward UnblockableInternet Communication. USENIX Work. Free Open Commun. Internet, 2011.

[25] Yu Ju Lee and Eric Wustrow. OverTorrent: Anticensorship without centralized serv-ers. 2016 14th Annu. Conf. Privacy, Secur. Trust. PST 2016, pages 388–391, 2016.

[26] Marcus Leech, Matt Ganis, Y Lee, Ron Kuris, David Koblas, and L Jones. Socksprotocol version 5. Technical report, 1996.

[27] Gang Liu, Xiaochun Yun, Binxing Fang, and Mingzeng Hu. A control method forlarge-scale network based on routing diffusion. 2003.

[28] Hui Liu, Xi Yao, and Xinpeng Li. A method and device for tcp connection reset.2003.

[29] H Mohajeri Moghaddam and B Li. SkypeMorph: Protocol obfuscation for Torbridges. Proc. . . . , pages 97–108, 2012.

74

[30] Andrew W Moore and Konstantina Papagiannaki. Toward the accurate identificationof network applications. Passiv. Act. Netw. Meas., 3431:41–54, 2005.

[31] Johan Pouwelse. Trustchain protocol. Internet-Draft draft-pouwelse-trustchain-01,Internet Engineering Task Force, June 2018. Work in Progress.

[32] The Tor project. The average connected clients of cymrubridge02, 2019.Accessed: 2019-03-01 https://metrics.torproject.org/rs.html#details/8F4541EEE3F2306B7B9FEF1795EC302F6B84DAE8.

[33] Michael G Reed, Paul F Syverson, and David M Goldschlag. Anonymous con-nections and onion routing. IEEE Journal on Selected areas in Communications,16(4):482–494, 1998.

[34] Haiying Shen, Alex X. Liu, Guoxin Liu, and Lianyu Zhao. Freeweb: P2P-assistedcollaborative censorship-resistant web browsing. IEEE Trans. Parallel Distrib. Syst.,27(11):3226–3241, 2016.

[35] Yair Sovran, Jinyang Li, and Lakshminarayanan Subramanian. Unblocking the Inter-net : Social networks foil censors NYU Technical Report TR2008-918. pages 1–19.

[36] Qingfeng Tan, Jingqiao Shi, and Bingxing Fang. Towards measuring unobservabilityin anonymous communication systems. 2015.

[37] Tribler. Creating a 2-hop circuit in ipv8, 2018. https://github.com/Tribler/tribler/wiki/Anonymous-Downloading-and-Streaming-specifications.

[38] Qiyan Wang, Giang T K Nguyen, and Nikita Borisov. CensorSpoofer : AsymmetricCommunication using IP Spoofing for Censorship-Resistant Web Browsing. Proc.2012 ACM Conf. Comput. Commun. Secur., pages 121–132, 2012.

[39] Zhongjie Wang, Yue Cao, Zhiyun Qian, Chengyu Song, and Srikanth V. Krish-namurthy. Your state is not mine. Proc. 2017 Internet Meas. Conf. - IMC ’17, pages114–127, 2017.

[40] Zachary Weinberg, Jeffrey Wang, Vinod Yegneswaran, Linda Briesemeister, StevenCheung, Frank Wang, and Dan Boneh. StegoTorus : A Camouflage Proxy for the TorAnonymity System. Proc. 2012 ACM Conf. Comput. Commun. Secur., pages 109–120,2012.

[41] Wikipedia. Great firewall — Wikipedia, the free encyclopedia. [Online; accessed14-March-2019].

[42] Brandon Wiley. Dust: A Blocking-Resistant Internet Transport Protocol. Defcon,2013.

[43] Philipp Winter, Tobias Pulls, and Juergen Fuss. ScrambleSuit. Proc. 12th ACM Work.Work. Priv. Electron. Soc. - WPES ’13, pages 213–224, 2013.

[44] Eric Wustrow. Telex : Anticensorship in the Network Infrastructure. Design,10(August):1–15, 2011.

[45] Xueyang Xu, Z. Morley Mao, and J. Alex Halderman. Internet censorship in China:Where does the filtering occur? Lect. Notes Comput. Sci. (including Subser. Lect.Notes Artif. Intell. Lect. Notes Bioinformatics), 6579 LNCS:133–142, 2011.

[46] Ruixi Yuan, Zhu Li, Xiaohong Guan, and Li Xu. An svm-based machine learn-ing method for accurate internet traffic classification. Information Systems Frontiers,12(2):149–156, 2010.

[47] Tao Zhu, David Phipps, Adam Pridgen, Jedidiah R. Crandall, and Dan S. Wallach.The Velocity of Censorship: High-Fidelity Detection of Microblog Post Deletions.2013.

75

76

Acronyms

ACL Access Control List. 12

AS Autonomous system. 13

BGP Border Gateway Protocol. 12

CAS Configuration and Attestation Service. 51

CRS Censorship resistance system. 8

DAS5 The Distributed ASCI Supercomputer 5. 59

DNS Domain Name System. 13

DPI Deep packet inspection. 17

DTLS Datagram Transport Layer Security. 60

FQDN Fully Qualified Domain Name. 24

GFW Great Firewall of China. v, 1, 5, 12

HTTP Hypertext Transfer Protocol. 13

HTTPS Hypertext Transfer Protocol Secure. 13

ICMP Internet Control Message Protocol. 24

IP Internet Protocol. 16

ISP Internet Service Provider. 16

MAC Media access control. 31

MITM Man-in-the -middle attack. 45

77

MX Mail exchanger. 33

NS Name Server. 33

OSPF Open Shortest Path First. 13

P2P Peer-to-Peer. 1, 14

RTT Round-trip time. 58

SGX Intel Software Guard Extensions. 48

SNI Server Name Indication. 6, 20

SOA Start of Authority. 33

SSH Secure Shell. 13

TCP Transmission Control Protocol. 17

TLS Transport Layer Security. 20

TTL Time to Live. 27, 30

UDP User Datagram Protocol. 17

VPN Virtual Private Network. 36, 60

VPS Virtual Private Server. 29

78