System for detection of websites with phishing and other ...

Masaryk UniversityFaculty of Informatics

System for detection ofwebsites with phishing and

other malicious content

BachelorŠs Thesis

Tomáš Ševčovič

Brno, Fall 2017

Declaration

Hereby I declare that this paper is my original authorial work, whichI have worked out on my own. All sources, references, and literatureused or excerpted during elaboration of this work are properly citedand listed in complete reference to the due source.

Tomáš Ševčovič

Advisor: prof. RNDr. Václav Matyáš, M.Sc., Ph.D.

i

Acknowledgement

I would like to thank prof. RNDr. Václav Matyáš, M.Sc., Ph.D. for themanagement of the bachelor thesis, valuable advice and comments.I would also like to thank the consultant from CYAN Research &Development s.r.o., Ing. Dominik Malčík, for useful advice, dedicatedtime and patience in consultations and application development.

Also, I would like to thank my family and friends for their supportthroughout my studies and work on this thesis.

ii

Abstract

The main goal of this bachelor thesis is to create a system for detectionof websites with phishing and other malicious content with respect toJavascript interpretation. The program should be able to downloadand process thousands of domains and get positive results.

The Ąrst step involves examining an overview of automated webtesting tools to Ąnd an optimal tool which will be used in the mainimplementation. The thesis contains overview of technologies forwebsite testing, their comparison, overview of malware methods onwebsites, implementation and evaluation of the system.

iii

Keywords

Chrome, Javascript, link manipulation, malware, phishing, URL redi-rects, XSS, Yara

iv

Contents

1 Introduction 1

2 Overview of approaches to website testing 32.1 Manual testing . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Automated testing . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Selenium . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Other website testing tools . . . . . . . . . . . . 8

3 Comparison of tools for automated website testing 113.1 Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Compared tools . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Conclusion of comparison . . . . . . . . . . . . . 17

4 Detection of phishing and other malicious content 194.1 Malicious content on websites . . . . . . . . . . . . . . . . 19

4.1.1 Phishing . . . . . . . . . . . . . . . . . . . . . . . 194.1.2 Other malicious content . . . . . . . . . . . . . . 21

4.2 Detection of phishing . . . . . . . . . . . . . . . . . . . . . 224.2.1 Cross-site Scripting . . . . . . . . . . . . . . . . . 224.2.2 URL Redirects . . . . . . . . . . . . . . . . . . . . 234.2.3 Link Manipulation . . . . . . . . . . . . . . . . . 244.2.4 Imitating trusted entity . . . . . . . . . . . . . . 254.2.5 Detection of other malicious content . . . . . . . 25

5 Implementation 265.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.2 Tools and libraries . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 Google Chrome . . . . . . . . . . . . . . . . . . . 275.2.2 Wget . . . . . . . . . . . . . . . . . . . . . . . . . 295.2.3 Beautiful soup . . . . . . . . . . . . . . . . . . . . 295.2.4 PyLibs . . . . . . . . . . . . . . . . . . . . . . . . 295.2.5 Yara patterns . . . . . . . . . . . . . . . . . . . . 30

5.3 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.4 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

v

6 Evaluation of results 326.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.1.1 Parallelism . . . . . . . . . . . . . . . . . . . . . . 326.1.2 Database . . . . . . . . . . . . . . . . . . . . . . . 33

6.2 Execution times . . . . . . . . . . . . . . . . . . . . . . . . 336.2.1 Conclusion . . . . . . . . . . . . . . . . . . . . . 33

6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.3.1 Comparison with Google safe browsing . . . . . 356.3.2 Efect of Javascript interpreter . . . . . . . . . . . 36

6.4 Further work . . . . . . . . . . . . . . . . . . . . . . . . . 37

7 Conclusion 38

Bibliography 39

vi

List of Figures

2.1 Selenium IDE plug-in for Mozilla Firefox. 6

3.1 Worldwide share of the usage of layout engines inNovember 2017. Data collected from [10]. 13

3.2 Worldwide share of usage of Javascript engines inNovember 2017. Data collected from [10]. 14

3.3 Example of script for downloading a website inPhantomJS. 14

3.4 Overview of main characteristic of headlessbrowsers. 17

3.5 Result of /usr/bin/time -v command. 18

4.1 How phishing works [18]. 20

5.1 A diagram of the program. 265.2 Output of one website. 31

6.1 Usage of RAM by Chrome. 326.2 The performance of testing PC. 336.3 Average execution times per page. 346.4 Average execution times by percentage. 346.5 Ratio of exposed malware per million domains. 356.6 Average execution times by percentage. 356.7 Results of detection in one million domains. 366.8 Comparison of results of Chrome and Wget. 37

vii

1 Introduction

Every day, malicious content on the Internet attacks numerous usersin every corner of the world. Deceptive techniques designed to obtainsensitive information from the user, by acting like a trustworthy entity,often appear within the web content. These techniques are known asphishing. Except phishing there are more threats on the Internet whichcan be injected via Javascript. All it often takes is just downloading anunveriĄed Ąle that can contain a computer virus.

The aim of this bachelor thesis is to explore the area of available testtools and technologies for the detection of such websites and applica-tions. These instruments must be able to interpret the Javascript codeof a website, acquire all its content and then work with it. Another aimis to compare the instruments to each other and to make an informedchoice of the best one for the practical part of this thesis. A furtherobjective of this thesis is to create a system that will be able to detectphishing and other dangerous content that appears on the websites,using the selected tool. The created system has to work eiciently andhas to be implemented and work on a Linux server.

I prepared an overview of the available options on how to testand retrieve the content of the websites. For viewing, handling orautomated testing of the web content, a basic rendering layout engineis always required. For interpreting Javascript is needed a Javascriptengine within the tool. Among the most common options that canprocess and test web content are headless browser, accessories forvarious web browsers, tools or libraries utilizing the environment ofthe browser as a means of obtaining content (e.g. Selenium). The onlyoption for interpreting Javascript within websites and running in thebackground of a server are headless browsers.

For the creation and implementation of a detection tool for ma-licious content, an extensive study of techniques which are used byattackers for to deceive users is necessary. Then, patterns need to befound which can detect certain malware or which determine the oc-currence of the searched malware. Phishing has many methods likecross-site scripting for injection of dangerous code to the website orURL redirect which moves a user to unwanted (mostly phishing) web-site. There are also viruses and trojan horses on the websites which

1

1. Introduction

can be detected by checking if Javascript code contains the malwarepatterns.

The implementation is designed for use in the background of aLinux system and with the possibility to process thousands of domains.There are a lot of ways to detect malicious content on the website. Inthis implementation, detection by user’s point of view when they comeacross an infected website was chosen. This means that the programdetects by information from DOM and by the domain name.

Chapter 2 is an overview of methods for website testing. Chapter 3is about comparison of tools for automated website testing and aboutits conclusion of optional tool for the main implementation. Chapter 4includes a summary of website malware methods and their detection.Chapter 5 contains a design of the main implementation and toolswhich were used there. The sixth Chapter describes evaluation ofrandom inputs of the program and commentary on the results ofindividual types of detection and how to Ąx weaknesses and slowparts of the program. The Ąnal chapter concludes this thesis.

2

2 Overview of approaches to website testing

2.1 Manual testing

One of the Ąrst types of testing at hand is manual testing, which hasfour major stages for minimizing the number of defects in the applica-tion: unit testing, integration testing, system testing, user acceptancetesting. The tester must impersonate the end-user and use all thefeatures in the application in order to ensure error-free behavior.

Information in this chapter is gathered from [1].

Unit testing: Manual unit testing is not much used nowadays. It isan expensive method, because the test is done manually. Testers havea test plan and must go through all the prepared steps and cases. Itis very time-consuming to perform all the tests. This disadvantage issolved through automated unit testing.

Integration testing: Tests are not prepared by a developer but by atest team. Flawless communication between the individual compo-nents inside the application must be veriĄed. Integration can also beveriĄed between the components and the operating system, hardwareor system interface.

System testing: After the completion of unit and integration testingthe program is veriĄed as a whole complex. It veriĄes the applicationfrom the customer’s perspective. Various steps that might occur inpractice are simulated based on prepared scenarios. They usually takeplace in several rounds. Found bugs are Ąxed and in the followingrounds, these Ąxes are tested again.

User acceptance testing: If all the previous stages of the tests arecompleted without major shortcomings the application can be givento the customer. The customer then usually performs acceptance testswith their team of testers. Found discrepancies between the applicationand speciĄcations are reported back to the development team. Fixedbugs are deployed to the customer’s environment.

3

2. Overview of approaches to website testing

2.2 Automated testing

Automated testing is a process of automating manual tests usingautomated instruments such as Selenium.

Automated testing has several advantages over manual testing. Itprevents errors where a part of the test is left out. In automated testing,the same code is always performed so there is no room for human error,such as a bad entry into the input Ąeld. Although at the beginning itis necessary to spend time creating the automated tests, a lot of time issaved ultimately because everything is tested automatically, tests canbe run parallely on multiple platforms and faster than if the individualacts were carried out by people. In addition, tests can be run withoutgreater eforts after any major change in the program of the testedapplication.

The website document was tested by DOM in combination withthe XPath language which is used to address individual elements ofthe website.

Information in this paragraph is drawn from [2].

DOM: Document Object Model (DOM) is a platform and language-independent interface and W3C international consortium standard.With DOM, programs and scripts can access the contents of HTMLand XML documents, or change the content. DOM allows you to accessthe document tree, add or remove individual nodes. The root node ofthe tree is always the Document node, representing the document asa whole. The speciĄcations of this standard are divided into severallevels where a newer level extends the original levels and preservesbackward compatibility.


XPath: XPath is a language for addressing nodes in XML documents.The result of an XPath query is a set of elements, or value of an attributeof a given element. XPath provides many ways to address a speciĄcelement. Both relative and absolute address (assigned by individualelements from the root to the given element) can be used. Furthermoreit contains many predeĄned functions for addressing the ofspring

4


and parents of the current node. It also contains commands for Ąndingan element with a speciĄc attribute value and many other features.


2.2.1 Selenium

Selenium is a suite of open source tools for automated testing of webapplications. It can be used for diferent browsers (Firefox, Chrome,Safari, Edge) including headless browsers like PhantomJS or Htm-lUnit. Selenium is a cross-platform and can be controlled by manyprogramming languages and test frameworks. Selenium attempts tointeract with the browser as a real user.

This set of tools includes four powerful components that will bedescribed in the following paragraphs.

Selenium IDE: Selenium IDE is used for creating tests and auto-mated tasks using add-ons for Firefox (see Ągure 2.1). These are madeby a user recording his or her activity which is then written in the formof individual orders into a table. The table can also be manually editedwith individual commands. These commands can be basic commands,such as click, insert text, but also advanced methods of verifying thetitle of the website, or various assertions similar to those of the JUnittest library for Java language. Individual commands contain threeitems. The Ąrst item is the command name, the second one speciĄesthe targeted command element, the last one contains the commandvalue.

Created tests (commands) can also be displayed in the HTML codeand it is also possible to export them into one of the programming lan-guages Java, Ruby, C# and then use them in the Selenium WebDriverdescribed below.

Selenium Remote Control: Also known as Selenium 1, consists oftwo main parts.

The Ąrst part is the Selenium Remote Control Server, which start-s/ends the web browser and interprets and runs Selenium commandsfrom the test program. These commands are interpreted using theJavaScript interpreter of a given web browser. After executing the com-mand, it returns the result of these commands to the testing program.

5


Figure 2.1: Selenium IDE plug-in for Mozilla Firefox.

Furthermore, it works as a HTTP Proxy, capturing and authenticatingHTTP messages between the web browser and the testing program.The server accepts Selenium commands by using simple HTTP GET/-POST requests, so a wide variety of programming languages can beused, which can send HTTP requests back to the testing program.

The second part are the client libraries that provide software inter-face between the programming language and the Selenium RC Server.The supported programming languages are Java, Ruby, Python, Perl,PHP, .NET, and Selenium ofers a client library for each of them. Theseclient libraries take Selenium commands and then transmit them tothe Selenium server to be performed which then returns the returnvalue of the command.

WebDriver: The main change in Selenium 2 is an integration of theWebDriver API. It was created to better support dynamic websites onwhich the elements can be changed without the need for the wholewebsite to be reloaded.

6


The WebDriver API function is such that when an instance of a classwhich implements WebDriver interface is created, a browser opens anew window and communicates with the instance. This instance canbe created using the ChromeDriver, FirefoxDriver and other classes.

Selenium 2 creates direct calls to a web browser using the nativesupport for automation. How the direct calls are implemented de-pends on the web browser on which they are implemented, unlikeSelenium RC which is forced to use the Javascript code because of theweb browser code and induce appropriate actions that way.

With an increasing number of tests and their increasing complexity,diferent limits are beginning to show. Selenium 2 can be very slowwhen in control of the browser and the Remote Control Server maybecome a bottleneck of testing. Parallel execution of multiple concur-rent tests on the same Remote Control Server will begin to manifestitself through reduction of the stability of tests (it is not recommendedto run more than six concurrent tests; this number is even smaller inInternet Explorer). Because of these limitations, Selenium tests aregenerally run consecutively in sequence or slightly in parallel. Theseproblems are solved by Selenium Grid.

Selenium Grid: Selenium Grid takes advantage of the fact that thetested web application and Remote Control Server/web browser doesnot have to be running on one computer, but on multiple computerscommunicating with each other using HTTP.

A typical use of Selenium Grid is that we have a certain set ofSelenium Remote Control Servers that can be easily shared across dif-ferent versions, test applications and projects of which each SeleniumRemote Control may ofer various platforms and versions, or types ofweb browsers. Selenium Remote Control informs the Selenium Hubabout which platforms and browsers are provided.

The Selenium Hub allocates Selenium Remote Control for speciĄctest requirements. It can request a speciĄc platform or version of theweb browser. Furthermore, it limits the number of concurrent testsand shadows Selenium Grid architecture of the tests. Shading Sele-nium Grid architecture is advantageous because a change of SeleniumWebDriver to Selenium Grid in the test program code requires almostno change.

7


2.2.2 Other website testing tools

Watir: Project Watir (Web Application Testing in Ruby) is a set ofopen-source (BSD) Ruby libraries for automating the web browser.Watir interacts with the web browser just like a real user, it is ableto click on links and buttons, Ąll out forms, and it lets you controlwhether to display the expected text on the website.

The project is multi-platform and supports Chrome, Internet Ex-plorer, Firefox, Opera, Safari. Browsers are controlled in a diferentway than HTTP test libraries (e.g. Selenium) are. Watir controls thebrowser directly using the "Object Linking and Embedding" protocolwhich is developed in the "Component Object Model (COM)". Theprocess of the web browser serves as an object, making its methodsaccessible for automation. These methods are then rendered from userprograms in Ruby language.


Huxley: A test system which, unlike the other systems mentioned inthis document, tests content based on visual appearance (comparingscreenshots). Automated tests cannot visually determine that some-thing is not right. This problem is solved by the Huxley system. It iswritten in the Python scripting language and can work in two modes.

The Ąrst mode is Playback where individual tests created in recordmode are run. It works by comparing the newly created screenshotswith those that were taken earlier. If the images difer, notiĄcationappears and the designer checks whether the error was indeed in theuser interface.

The second mode is Record where using the Selenium WebDriver,a new browser window is opened that records user actions. If the userwants to compare screenshots at a given point, he or she presses enter.


Splash: A Javascript rendering service that is used for testing thebehavior of a website or for taking screenshots of the website. Splashis written in Python, uses HTTP user interface, allows you to executemultiple websites in parallel and to use Ad-block Plus Ąlters for fasterrendering of websites. Its test scripts are written in Lua language.


8


SauceLabs: It is a service that works on the principle of SeleniumGrid and simpliĄes application testing for the developers in diferentoperating systems or web browsers. SauceLabs allows connection withthe Selenium test library.

There is a possibility for manual testing, where the user selects aplatform and browser for testing. After the selection, a window openswith remote desktop of the selected platform with a running browser.

Furthermore, SauceLabs ofers testing websites on Android andiOS using the open source automation tool named Appium. This toolcan test both native and hybrid applications, it can even test the mobileversion of Safari on iOS.

The cheapest version of this service costs $19 per month for a singlevirtual computer. The most expensive version costs $52 per month foreight virtual computers and 4000 minutes of automated testing. Theseprices are as of 1 August 2017.


BrowserStack: This service is similar to SauceLabs, it ofers manualtesting called Live, automated testing using Selenium and servicegenerating preview (screenshot) of the website.

BrowserStack compared to SauceLabs ofers many types of licenses.Automatic testing with two virtual machines costs $99 per month andwith Ąve machines it costs $199 per month. These prices are as of 1August 2017.


Headless browser: It is a type of browser that connects to a speciĄcwebsite, but without the GUI. This means that the user cannot see anyresult of the connected website.

These browsers support Javascript, DOM and contain a renderinglayout core like WebKit or Gecko. They do not support audio, video,plugins, Flash, Java, WebGL, geolocation, CSS 3-D, ambient light orspeech. Most of them can include and execute user’s script and useDOM events like clicking or typing for an user imitation.

Headless browsers are used for providing content for programsor scripts, automated testing, analyzing performance, monitoring thenetwork or taking a screenshot of the entire website content. They

9


are also used to perform a DDoS attack or increase advertisementimpression.

Headless browsers are described in more detail in Section 3.1.

10

3 Comparison of tools for automated websitetesting

In this Chapter, selected tools and their comparison are brieĆy de-scribed with the goal of selecting the best tool for the main imple-mentation in this bachelor thesis. Six diferent headless browsers werechosen according to diferent combinations of their cores and program-ming languages. They are PhantomJS, SlimerJS, HtmlUnit, ZombieJS,Ghost.py and Google Chrome (in headless mode).

Headless browsers are the best choice for this implementation, asall the browsers run without GUI and only in the background of thesystem. This also decreases the load on operating memory and CPUworkload. Additionally, they contain a Javascript interpreter, whichmeans that they can return the contents of the website to the outputonly after an execution of the JavaScript code that the website contains.

3.1 Criteria

For a suitable tool, a solution is needed that can run in the backgroundof a Linux operating system. Therefore, tools that work only withother operating systems have been excluded. Furthermore, the toolmust be able to interpret Javascript code. Not all headless browsersguarantee a Ćawless and full implementation of Javascript, which isextremely important for the implementation to make sure some kindsof suspicious behavior do not pass unnoticed. These criteria are givenby the speciĄcations of the bachelor thesis.

Moreover, the tool should have a comprehensible documentationfor comfortable studying of the options of its API. It would also begood for the tool to have a user forum with developers or other usersof the tool who would be able to deal with potential problems that mayarise during implementation. After discovering a malfunction of anyof the tool components, the tool should have the opportunity to reporterrors that would be Ąxed in future versions. In that case it wouldbe appropriate to choose a tool that is constantly being developed,possibly one with regular updates or at least one that Ąxes occurringbugs. These criteria are not present in the assignment of the bachelor

11

3. Comparison of tools for automated website testing

thesis; however, for an eicient and practical use of the tool, they needto be included as not all tools necessarily fulĄll them and they may bekey to the illustrative function of the thesis implementation.

The tool should be able to process the largest number of domainsspeciĄed on the input in the shortest possible time (e.g. 100,000 do-mains) with the lowest RAM workload possible. These parameterswill be shown in section 3.3. where a comparison of speed and utiliza-tion of operating memory for each tool will be shown. The selectedtool should contain a commonly used Layout rendering core and aJavascript rendering core. These cores are also used by regular webbrowsers and because of the use by these browsers, they are a muchbetter match than cores created just for a speciĄc tool. These criteriaare also not listed in the speciĄcations of the bachelor thesis, but theyare important for selecting a quality, universal tool.

Layout rendering core: Also known as web browser engine, it is aprogram for rendering of the website. It renders marked-up contentlike HTML, XML or images, and formats information such as CSSstyles. The layout rendering core does not have to be used only in webbrowsers, but also in every system or device that somehow accessesthe content of the website.

The layout core with the highest utilization is WebKit, which isused by Safari from Apple or Google Chrome, which uses a newversion, a fork called Blink core. Second place is held by Gecko, whichis mainly used in Mozilla Firefox. Internet Explorer uses Trident core,Microsoft Edge uses EdgeHTML and Opera uses Presto.

Javascript rendering engine: This engine is a program that is re-sponsible for executing Javascript code and is mainly used in webbrowsers.

The engine with the largest worldwide share is V8 which is de-veloped by Google and mainly used in Google Chrome and Opera.Firefox has SpiderMonkey engine (early versions used RhinoJS). In-ternet Explorer and Microsoft Edge have the same Javascript enginecalled Chakra. Safari uses JavaScriptCore, which is part of the presentWebKit.

12


Figure 3.1: Worldwide share of the usage of layout engines in Novem-ber 2017. Data collected from [10].

3.2 Compared tools

PhantomJS: A cross-platform headless browser written in C++ thatuses rendering layout engine WebKit with scriptable Javascript API.Javascript rendering engine is JavaScriptCore (like Safari). PhantomJSsupports working with the DOM, HTML5, CSS3 selectors, JSON, SVGand Canvas. Its API allows the user to open a website, modify the con-tent of the website, click on links, automated website testing, jQuerysupport, set or change cookies, capture screenshots, listen to the net-work requirements and subsequently convert them into HAR format,simulated keyboard and mouse and read/write Ąles. PhantomJS itselfis not a testing framework, it must include a test runner like CasperJS,Jasmine, QUnit, etc. It may also be used as a browser (instance) for theSelenium WebDriver.

PhantomJS is the most used headless browser mainly for scrapingthe content of the website or testing the website. Documentation isnot complete, but the website of the tool also contains many usefulexamples for its diferent uses and utilization of its options.

Information in this subsection is drawn from [11].

SlimerJS: A tool similar to PhantomJS, but with other renderingengines. The layout rendering core is Gecko by Firefox and Javascriptrendering engine is SpiderMonkey. It uses a very similar API to Phan-

13


Figure 3.2: Worldwide share of usage of Javascript engines in Novem-ber 2017. Data collected from [10].

Figure 3.3: Example of script for downloading a website in PhantomJS.

tomJS but some features are missing. The developers of this tool aretrying to make API exactly same as PhantomJS, but it is still in the pro-cess. SlimerJS is not really a headless browser, after it launches Firefoxbrowser is also opened until the script is completed (dependence onFirefox should be removed in the next version of the tool). It runs onplatforms on which Firefox can run. It also supports features such asaudio, video, etc. SlimerJS can become headless only in the case ofusing xvfb tool (display server). The tool has a dependency on Firefoxand will not launch without its installation.


HtmlUnit: It is a headless browser written in Java and used for Javaimplementation. It supports two diferent rendering layout enginesthat you can choose from (WebKit and Gecko). The Javascript render-ing engine is RhinoJS (developed by Mozilla) which is used very little

14


nowadays. HtmlUnit supports many Javascript libraries for unit testssuch as jQuery, MochiKit, Dojo, etc. It can work with AJAX (Asyn-chronous JavaScript and XML) and compared to PhantomJS, does notyet include full support of Javacript.


ZombieJS: This headless full-stack testing tool [14] using Node.jsallows the user to interact with the website, but the tool itself does notallow the user to do assertions. This requires a test framework thatruns on Node.js, such as Mocha or Jasmine. The testing frameworkallows the user to run tests serially and to do Ćexible reporting of tests,which makes testing easier. To install ZombieJS, Node.js libraries andio.js that are necessary for the functionality of the tool.

Ghost.py: Python library [15] for testing and scraping of web content.Ghost.py uses the WebKit web client for accessing website contentand Javascript interpreter JavaScriptCore. For proper operation, aninstallation of PySide or PyQt is required.

Google Chrome: Freeware web browser developed by Google whichalso includes a headless environment from version 59 on Mac andLinux (60 on Windows) [16]. Its web engine is Blink which is a fork ofWebKit and its Javascript interpreter is V8. The headless environmentin Linux is executed by the following command:

1 google−chrome−s t a b l e −−headless −−disable−gpu http ://example. com

Listing 3.1: Command for the execution of headless chrome.

The command includes three main features which are creating a pdf(–print-to-pdf), printing the DOM (–dump-dom) and taking a screen-shot of website with given resolution (–screenshot –window-size).

3.3 Evaluation

Written in: The tool which will be chosen should process even thou-sands of websites in a short time. The best option of ofered languages

15


in which the instrument is written is C++ due to its speed comparedto other languages of the selected instruments.

Supported language: The best choice of supported language is Javascript,because in the main implementation the best solution is to work withJavascript code of the website to Ąnd patterns of phishing cases whichare mostly written in the code. It is also possible to use many Javascriptlibraries as jQuery with many useful functionalities and for better clar-ity of the code.

Layout rendering core: It determines the same website view as theuser has. To see exactly the same output as the majority of worldwideusers, the best solution is to choose the most used layout engine, Blink.

Javascript rendering core: The most suitable Javascript engine forBlink is V8 as they are developed by the same company and they worktogether well.

Community/forum: Only HtmlUnit, PhantomJS, SlimerJS and Chromecan be listed among active communities which have their forum whereusers can help each other.

Execution time: A simple script for each tool was written that down-loads the full text of the website after performing a simple Javascriptcode. The tested website is a locally stored basic HTML website withvarious basic elements. The Javascript code that appears on this web-site is again a simple script that changes the content of the elementson the website. The results in the table present the average speed often attempts of loading the website. The result is very relative but itgives us a basic overview of values in which the results range. Thetesting was conducted locally in order not to degrade the results inregard to the Internet connection.

The Linux command /usr/bin/time with the parameter -v was usedfor the execution time of each script in speciĄc tools which also showsthe maximum RAM usage during the performance and other infor-mation about the memory.

16


Figure 3.4: Overview of main characteristic of headless browsers.

Testing was conducted on a Lenovo ThinkPad e520 NZ35WMClaptop with a 64-bit operating system Fedora 23.

DOM comparison: The printed DOM after Javascript execution fromrandom websites for each tool were then compared. Surprisingly notevery tool returned the same result. PhantomJS, SlimerJS, HtmlUnitand Ghost.py returned an empty page for http://youtube.com and Zom-bieJS also for http://seznam.cz. Chrome headless mode did not have anyproblem with any of the tested pages and returned the same result asthe web browser.

3.3.1 Conclusion of comparison

In this chapter, six diferent tools were chosen to be compared witheach other. The main criteria for selection of the best tool were thesupport of Linux, running in the background, active community, tech-nical requirements, including the highest processing speed and thelargest use of rendering engines.

The analysis shows that the Blink layout rendering core (fork ofWebKit) has the largest share of use in web browsers (including mobiledevices) in the world, therefore the exact same result of the website asthe majority of Internet users have is available using it.

The section assessment describes the ideal tool parameters. Mostof them are met only by one tool which is PhantomJS. However, Phan-tomJS cannot return the DOM of all the tested websites. The only

17


Figure 3.5: Result of /usr/bin/time -v command.

tool with correct results is Google Chrome’s headless environment.Therefore this tool will be selected for the main implementation. It iswritten in C++, it fully and Ćawlessly supports Javascript, it has a Blinklayout core and V8 Javascript engine. Additionally, it has the largestcommunity and use in the area of headless browsers. In speed testingof all the tools, it Ąnished as the fastest with an average utilization ofoperating memory compared to the others.

18

4 Detection of phishing and other maliciouscontent

This chapter gives deĄnition of malicious content that is found onwebsites. Furthermore it contains a description of methods on how todetect some types of these attacks and malicious content.

My thesis is specialized to detect malicious content based on thecontent of the tested website and on processing URL on given input.

4.1 Malicious content on websites

4.1.1 Phishing

Phishing is a way to obtain sensitive data (password or credit carddetails) from the user for their subsequent abuse by acting like atrustworthy entity. Most often it takes advantage of the gullibilityof the user in a way that the user will not notice at Ąrst glance and,in the worst case, the attacker retains the user’s data without theirknowledge.

One of the Ąrst phishing tactics is sending fraudulent e-mails.These emails contain links to malicious websites that look, for example,like the website of the user’s bank. The website can contain a form forobtaining the user’s information or a link to downloading a maliciousĄle.

There are 4 basic ways to spread phishing content to users. TheĄrst one is the email-email way where communication is only by email.The second one is email-website where a user receives an email witha link to an infected website. Another way is website-website that iswhen a user Ąnds a link on an uninfected website to an infected one.And the last way is browser-website when a user installs a dangerousbrowser extension which redirects the user to phishing websites.

Another way is phone phishing. It is a fake call from user’s bankwhen a caller requires the user’s personal data. Also using the phone,the victim can be induced to share their information by an SMS.

Nowadays, the number of websites and users is constantly increas-ing, thus also increasing the creation and spreading of more phishingwebsites which include intelligent tactics to trick the user. Attackers

19

4. Detection of phishing and other malicious content

are always trying to Ąnd other ways to confuse users with somethingnew that they have not encountered before and therefore the userswill be deceived easily.


Figure 4.1: How phishing works [18].

Cross-site scripting: Also known as XSS, cross-site scripting is atrick for the disruption of a website using untreated input errors oc-curring in the scripts of the website. Using this security loophole, theattacker is able to insert a the custom Javascript code that can retrievesensitive information, damage the website or display diferent contentto the user [19].

URL Redirects: This type of phishing is performed so that aftervisiting the website, the user is redirected to another website. The

20


redirection can be occurred in Javascript, HTML meta tags or in plug-ins. The redirection does not have to be only to one address, it can alsobe to several diferent infected websites.

Link manipulation: Most phishing methods involve using a link,which is found in emails or directly on a website. These links looklike credible sites, for example the user’s bank. The name of the linkis changed so that there are typos in its original version or it usesmultiple domains and subdomains that the original version does not.It is to confuse the victim who has no idea that he or she is beingredirected to an infected page.

Manipulation with the links can be implemented directly in anHTML document or also in Javascript. In this case the address of thelink changes right after the page is loaded or after you click on thelink.

Imitating trusted entity: The most sophisticated method of phish-ing is when the attacker acts like a trusted entity where the victimcan have an account. This method can be in the form of an email com-munication between the user and an entity or the victim can comeacross a website which is almost identical to the original. The userwill write his or her private information to the infected site and thenthey can be redirected to the original website without their notice.Nevertheless, there are better ways to deceive a user. For example, thevictim can open a link which opens an original website with a fakeform for entering private victim’s data in a new browser window, thistype of phishing is also called "tabnabbing".

4.1.2 Other malicious content

Malicious software, or malware in short, includes adware, spyware,viruses, trojan horses, ransomware, rootkit and the above mentionedphishing. Malware is harmful software which enables the attacker toaccess the victim’s computer.

Diferent types of malware occur on websites mostly as download-able binary Ąles which are diicult to detect without downloading theĄle. However, there are some malware types described below whichcan appear within the source code of the website.

21


Spyware: One type of spyware which can be encountered on a web-site is a keylogger. It is a software which scans the activity of the user’skeyboard and sends it back to the attacker’s server. This activity of thekeyboard can be speciĄed and scans only the <input> tag with theparameter type set to the value password. By this method an attackercan easily get passwords or credit card numbers of a victim.

Ransomware: It is an "extortionate" method which encrypts all Ąleson victim’s storage and forces the user mostly to pay an untraceableBitcoin (because Bitcoin transaction are untraceable) ransom for pro-viding the access password to the storage. Websites are used for spread-ing ransomware. Attacker infects a website server with this type ofsoftware [20] and an owner is forced to pay for encryption.

4.2 Detection of phishing

There are more situations when a website can be infected. In this thesis,the detection of malicious content is viewed from the user’s point ofview. This means that the key question is what can actually happento a user when he or she comes across an infected website. From thispoint of view, we can do the detection from the source code and fromthe given URL. If there is an infected Ąle, it is not detected. It is also notdetected (from an administrator’s point of view) if the site is damagedor infected Ąles had been uploaded. This is the reason why this thesisdoes not include the detection of trojan horses, viruses, and othertypes of malware.

4.2.1 Cross-site Scripting

There are three types of XSS by which an attacker can inject theirJavascript code into the website. It is about inserting their own scriptinto inputs on the site or into the URL.

DOM Based: A local type of XSS in which it is possible to moveuntreated URL variables into the code. To achieve this attack, it isnecessary to manipulate this URL variable in the Javascript code as

22


a document.write(). Then the harmful code can be written directlyinto the address, for example in the following URL:

1 http :// website . com/page . html ? va r i ab l e=<sc r ip t >doEvilCode ( );</ s c r ip t >

Listing 4.1: Example of DOM based XSS.

This command will be added by document.write() to DOM and exe-cuted on the website.

ReĆected: Also known as non-persistent XSS, it is almost the sameas the DOM based type. The diference is that this problem can occuron unsecured dynamic websites. If a site is, for example, written inPHP and there is something which can print the variable, it printsthe variable into the website code. Therefore, an attacker can also putsome unsafe code into the URL.

Persistent: This attack changes a website permanently because ofthe vulnerability of a database on the website. This XSS is created byan insertion of a dangerous Javascript code to an input element on thewebsite. For example it can be part of a text input in a form. This typeof XSS is not detected in the main implementation, because it cannothappen without user intervention.

Detection of XSS: Javascript code does not have to be only in the<script> tag. It can also appear in other tags with DOM event han-dler attributes (onclick, onmouseover, onerror, etc.). To deceive, thismethod adds encoded URI (Uniform Resource IdentiĄer) schemesto the src attribute. 64-bit encoding can also be used. Other ways torecognize more XSS are available here [21].

4.2.2 URL Redirects

The easiest way to redirect in Javascript is to change the window.location.href

variable. The same result can be achieved by change document.location.href.The location.assign() method or window.navigate() load and dis-play the new document speciĄed in the URL. There are also ways tochange the URL in the history, so the infected website can be visited

23


by going back in the browser. There are many libraries in Javascriptwhich use the same methods using other syntax.

In HTML code can also be redirected by a meta tag which speciĄesthe URL and time in seconds when the window will be redirected, forexample:

1 <meta http−equiv=" r e f r e sh " content=" 0 ; ur l=http :// e v i l . com/"/>

2 <meta http−equiv=" l o c a t i o n " content=" 0 ; ur l=http :// e v i l . com/"/>

Listing 4.2: Meta redirects.

Iframes can also be used for a redirect. An iframe, there is a HTMLtag which allows us to display a website within a website. Anotherway which is used for a redirect is an unpaired base tag which addscontent from its src attribute.

URL Redirects can be done by plug-ins like Flash, Java, PDF, Quick-Time or XML in ways which are speciĄed in greater detail in [22].

4.2.3 Link Manipulation

In the detection of link manipulation, we cannot conĄrm that thelink can lead to an infected website but we can Ąnd some suspiciousbehavior.

Nowadays, links do not use the form of an IP address insteadof the domain name because most legitimate websites do not use it.Therefore, if the link is an IP address, it is very suspicious.

More than 5 dots in a link is also an indication and needs to bechecked. It can be too many domains or two addresses in a URL whichmay also be phishing.

The URL can contain the "@" symbol. In this type of a link, abrowser ignores everything to the left of the "@" sign and the linkwill lead to the right of the "@" character, for example [email protected] will be redirected to fake-paypal.com.

Links can also be encoded in hexadecimal format, which the webbrowser can decode but the user does not see the decoded URL. Thisis used for evading blacklists and other Ąlters.

The link of an infected website can be detected by using blacklists(like phishtank) which contain phishing domains.

The information in this subsection is drawn from [23].

24


4.2.4 Imitating trusted entity

The most imitated entities are the most visited web pages on theInternet from which an attacker can proĄt (like paypal.com, ebay.comor facebook.com, etc.). List of the most visited websites is available onalexa.com.

This phishing method can be detected by a comparison of thescreenshot of the detected website to other screenshots of the mostvisited web pages. This is a very accurate method to detect a phishingwebsite [24]. There are problems which have to be solved like ads,animations and time-dependent multimedia. This can be solved bymore screenshots which are taken at diferent times after loading thepage.

Another method to detect the same content of the website is takingall the text from the detected website and comparing it to the mostvisited web pages. If there is a very high percentage of conformity, thewebsite is phishing.

4.2.5 Detection of other malicious content

Other types of malware like viruses, trojan horses, etc. are hard todetect because they are in the form of a Ąle on the website, but it ispossible to reveal them by checking the patterns from trusted sourcesor by an occurrence of the exact same malware source code. Anothermethod to detect malware is by machine learning.

Keylogger: This type of spyware can be revealed by checking if theJavascript code contains capturing keyboard events and then sendingthem to the server. The keyboard events are keydown which starts bypressing a key the keypress event starts by holding a key and keyup byletting a key go. This behavior of a web page stops being innocent inthe case when the code also scans the <input> elements and sendsthem to the server.

25

5 Implementation

The practical part of this bachelor thesis is the implementation of anapplication for the Linux operating system which detects phishingand other malicious content of given websites on input.

5.1 Design

Figure 5.1: A diagram of the program.

Programming language which I chose is Python because it hasmany useful functions for work with URLs or Yara. It is also a languagewhich is mostly used in the CYAN Research & Development s.r.o., sotheir helpful libraries can be used in the program and the programcan be developed further if the company decides to extend it.

26

5. Implementation

On the beginning is needed to download a website from a giveninput. The input is processed in Main function which takes a domainand gives it to Wget to get DOM. To get DOM after Javascript executionby Google Chrome which calls also the Main function. Then, it ispossible to see an information how big impact Javascript has on themalicious website by comparison of results of both DOMs. Chromeand Wget are run in bash by Python.

Both documents are sent to Parser class which takes each documentand transfer it to a BeautifulSoup object. Then all needed parameters,elements and text which are further used are extracted. This data aresend to Detector class.

The Detection methods are in Detector class which includes mal-ware detection like link manipulation (using blacklist database), XSS,redirects and Javascript malware by using Yara patterns. A dictionaryis created for every domain with all results of each detection in theclass.

The last part of the program is a class called Evaluation which takesthe dictionary data and by their results decides if the website is safe,suspicious or dangerous. Then, it prints the results of detection withall threats which were detected.

5.2 Tools and libraries

This section describes all outsource tools and libraries which are usedin the main implementation.

5.2.1 Google Chrome

Before the start of the main implementation, it became apparent thatGoogle Chrome does not return the whole HTML document of a givenwebsite. Its headless mode (with parameter –dump-dom) returns onlythe <body> element. This is insuicient for the detection of phishing,because Javascript can encroach even into the elements within the<head> tag in the HTML document.

Google Chrome is based on an open source project called Chromium.They are both almost the same project except Chromium does not haveGoogle updates, extension restrictions, crash and error reporting, se-

27

5. Implementation

curity sandbox, adobe Flash support and it only has the support of afew codecs.

In a Ąle of the Chromium source code headless_shell.cc, we ĄndFetchDom() function which executes –dump-dom parameter from thecommand line. There is a bug which prints only the <body> of thedocument instead of the whole DOM. The solution to this problem isto build an own Chromium version where this bug is Ąxed.

In this implementation, a Ąxed version was built which was thenbuilt into the RPM (Red Hat Package Manager) package for Linuxto prepare it for the use on other devices. The bug was Ąxed in theChromium project on 2017-08-22. The building of Chromium wasunsuccessful on Fedora 22, but on Ubuntu 16.10., it worked Ćawlessly.

In the main implementation, Chromium is used for download-ing the DOM of the website after executing the Javascript. A Pythonprogram executes it as a bash command. This command has extraparameters –timeout and –virtual-time-budget to set a timeout inmilliseconds during which the browser waits for a response of a webserver.

1 bash_command = ’ google−chrome−s t a b l e −−headless −−t imeout=60000 −−v i r tua l−time−budget=60000 −−disable−gpu −−dump−dom http :// ’ + domain

2

3 t ry :4 output_dom = subprocess . check_output ( [ ’ bash ’ , ’−c ’ ,

bash_command_after_js ] , t imeout =120)5 except subprocess . Cal ledProcessError :6 cp r in t ( "WARNING: Something has gone wrong with

execut ing Chrome , domain : " + domain , ’ red ’ )7 continue8 except subprocess . TimeoutExpired :9 cp r in t ( "WARNING: Something has gone wrong with

execut ing Chrome ( timeout expired ) , domain : " + domain ,’ red ’ )

10 continue

Listing 5.1: Use of Chromium in the implementation.

28

5. Implementation

5.2.2 Wget

Wget is a program [25] for downloading Ąles by the HTTP, HTTPS andFTP protocols. It supports recursive downloading or downloading viaa proxy and it is on the basis of some Linux systems.

It is used in the same way as Chromium but for downloading awebsite before executing the Javascript of the website. In the program,Wget is used with -t parameter and number of retrying to down-load with set timeouts in seconds –connect-timeout and with -q0-

parameter to turn of the Wget output.

1 wget −qO− −t 1 −−connect−t imeout=5 http :// example . com

Listing 5.2: Wget command.

5.2.3 Beautiful soup

A Python library [26] for processing HTML and XML documents.Beautiful soup is able to parse a document and Ąnd any part (elementsor attributes) of the code and then work with it.

It includes many useful functions for working with a document,such as getting all the text or aligning the document for a nicer printand more efortless read. This makes all operations with HTML mucheasier.

1 from bs4 import Beauti fulSoup2

3 with open ( " index . html " ) as fp :4 soup = Beauti fulSoup ( fp )5

6 l i n k s = soup . f i n d _ a l l ( ’ a ’ )

Listing 5.3: Finding all the links via Beautiful soup.

5.2.4 PyLibs

PyLibs is a proprietary library developed by CYAN Research & De-velopment s.r.o. for working with the URL. It allows processing sub-domains, host, port, path or query of URL.

29

5. Implementation

5.2.5 Yara patterns

A tool [27] for researchers of malicious content which classiĄes andidentiĄes malware samples by patterns in yar Ąles. The rule has threeparts. The Ąrst one is meta which shows information about the rule.The second are the strings which are the searched part of malwarecode and the last one is the condition which determines the rules forthe strings.

The implementation uses open source Yara patters for searching formalware content on websites. There is a Ąle with simpliĄed patternswhich have just the strings part and their condition.

1 ru le s i l en t_banker : banker2 {3 meta :4 desc r ip t i on = " Example of pat te rn "5 in_the_wild = f a l s e6 th read_ leve l = 57

8 s t r i n g s :9 $s1 = {8D 4D B0 2B C1 83 C0 27 99 6A 4E 59 F7 F9 }

10 $s2 = "UVODIHLNWPEJXKCBGMT"11 $s3 = {6A 40 68 00 312 0 00 00 6A 14 8D 91}13

14 condi t ion :15 a l l of them16 }

Listing 5.4: Example of Yara pattern.

5.3 Input

The program will start by following command with parameters de-scribed below:

1 python3 main . py −d phishing . com

Listing 5.5: The command for start program.

Parameters: There are only two parameters to choose. If the userwants to check just one domain he or she has to write -d domain.com

30

5. Implementation

or whole Ąle of the domains -f file_path. The domains have to bewritten without protocol.

5.4 Output

For each website, there is a clear overview of types of detection whichoccur on the website. In the output, Ąnal status of the website is alsostated which says whether the domain is safe, suspicious or malicious.A website is marked as a safe only when the program does not Ąndanything. Websites are considered as suspicious when they have do-main in the form of an IP address, contain links with IP address on thewebsite, contain manipulated links or have URL redirected outsidethe domain. A website is dangerous/malicious if the domain is in ablacklist, contains blacklist links, the domain is XSS or the websitecontains XSS links or there is at least one found malware pattern onthe website.

Figure 5.2: Output of one website.

Also there is a summary on the output. It is displayed after theexecution of each website for better overview of all processed websites.The Ąnal output is the same summary with a list of dangerous andsuspicious websites.

31

6 Evaluation of results

In this chapter, I describe the optimization of problem parts of theprogram. The problems were Ąxed after a few Ąrst executions. Thenrunning times of individual parts of the program and its detectionwere measured. The Ąnal detection with results was executed after anupdate and improvement of deĄciencies of the program.

From the overview of times and by the Ąnal results, it becomesapparent that there are some types of detection in the program whichare time consuming and at the same time have a small count of de-tected websites found. Also, it is apparent whether it is necessary tohave a headless browser with interpreting Javascript or a DOM of adownloading tool has the same results without Javascript interpreter.

6.1 Optimization

6.1.1 Parallelism

At the Ąrst execution of the program, detection had lasted long be-cause of waiting for Wget and Google Chrome. The tools for down-loading had 15 seconds timeout for leaving a website which has noserver response. The detection was executed serially, therefore a sim-ple solution would be to use more processes at once. One processwould test one website. It depends on performance of the computeron which the program is executed how many processes is able to run.The biggest problem is RAM capacity. The highest recorded RAMusage by Chrome was 252 MB.

Figure 6.1: Usage of RAM by Chrome.

32

6. Evaluation of results

6.1.2 Database

Another problem which was hard to neglect from the Ąrst executionswas database. Every link on the website is searched in a blacklistdatabase. On the website, there can be hundreds of links, therefore,the database needed better indexing which was achieved by adding aunique constraint on the domain column in blacklist table.

6.2 Execution times

The execution times were taken by 100,000 successfully downloadedand detected domains. We can see in statistics below which part of theprogram is the slowest and eventually how to reduce the executiontime. The program was measured on computer with the followingperformance:

Figure 6.2: The performance of testing PC.

6.2.1 Conclusion

The average time of detection per website is six seconds. The longesttime required is for download of a website via Chrome. This delayis hard to evade (also with Wget) because the main problem is thenetwork and downloading the website from its server with respect totimeout. Chrome takes longer time than Wget because of Javascriptinterpretation and its execution.

Some websites have hundreds of links and some of them a few ofthem. For faster execution of database commands, it is better to ask ifthe links are in blacklist at once and not with each link. Some websiteshave hundreds of links and some only a few of them.

In the chart below, it can be seen that other parts of the programin comparison with the above are not time consuming.

33


Figure 6.3: Average execution times per page.

Figure 6.4: Average execution times by percentage.

6.3 Results

The statistics from Figure 6.6 are taken from the result of one millionexecuted malware domains. The program was running 8 days and 5hours and detected 44 percent of malicious website. The program didnot revealed all of the phishing and other malware domains becausethe detection was aimed at the DOM of the website before and afterJavascript interpretation and at the domain name.

Most of the websites were detected by blacklist or by links whichlead to a blacklisted website. Many domains contained XSS whichwas revealed in Javascript code. Also, there are non negligible resultsof malware patterns and link manipulation found. Number of the

34


Figure 6.5: Ratio of exposed malware per million domains.

Figure 6.6: Average execution times by percentage.

links that have a XSS (in query part of URL) is small because theattackers use it to attack unsecured dynamic websites. The low amountcan be caused by the fact that most frameworks which create webapplications are secured against the XSS by default. No redirect wasdetected because the program detects only HTML redirect methodsand the ways which can redirect the website out of the domain inJavascript. Redirects can also be provided by web server config Ąles orin the .htaccess of each directory on the server.

6.3.1 Comparison with Google safe browsing

Safe browsing by Google [28] provides protection against maliciouswebsites to users of Google search, Android, Google Chrome, Firefox,

35


Figure 6.7: Results of detection in one million domains.

Safari and more. In the browsers users are warned when they visits amalicious website.

Also the tool has API which contains a check whether the requiredwebsites are dangerous. In this implementation, it was given the sameinput of million domains as in the main implementation. Google safebrowsing marked 26 percent websites as malware positive. That is 18percent less than the main implementation.

6.3.2 Efect of Javascript interpreter

One of the most important questions of this thesis is whether Javascriptinterpreter has any efect on the number of detected websites to makea diference from just detecting DOM. Hypothetically Javascript cancompletely change the whole content of the website for adding mal-ware content or even for obfuscating inserted malicious code whichalready is on the website.

The used tool for downloading websites without Javascript inter-pretation was Wget. The results compared with the Chrome are shownin Figure 6.8.

Detecting by DOMs from Wget, the program detected 28 percentless dangerous web pages than DOMs returned by Chrome. The pro-gram with respect to Javascript interpretation can detect more mal-ware websites. It is mostly because of the detection XSS in Javascriptcode and by more found malware patterns. Also without Javascriptinterpretation the program detected more manipulated links becauseJavascript code could obfuscate them.

36


Figure 6.8: Comparison of results of Chrome and Wget.

6.4 Further work

For further work on the implementation of the thesis, more featuresfrom headless browsers can be used. Headless browsers allow to actlike normal web browsers, so that detecting HTTP headers and HTTPrequests of the websites is possible. New methods can be found todetect phishing or Ąnd more researched Yara patterns to detect theheaders or requests.

Another interesting use of headless browsers is the possibility oftaking screenshots of the website. This way, they are able to achievedetection of the imitation of trustworthy entities/websites. There areproblems which need to be dealt with like animations and ads whichhave an efect on the comparison of the imitated website with theoriginal.

37

7 Conclusion

This bachelor thesis deals with creating a system for detection ofwebsites with phishing and other malicious content with respect toJavascript interpretation. The aim was to Ąnd the best solution forinterpreting Javascript and to develop a detecting system which workswith the chosen tool. First, an an overview of automated web testingtools was prepared. The best features were found in headless browsers,therefore, a survey and comparison of them was conducted. The bestsolution in comparison was Google Chrome’s headless mode becauseit is able to run without GUI and it has Ćawless Javascript interpreter.

This thesis examined the issue of malicious website content andphishing methods which can be used on the Internet. There are a lotof ways to be able to detect dangerous content and here, detectionbased on DOM information of the website was chosen.

The main implementation of the thesis was designed with respectto downloading a web page after executing its Javascript code. Af-ter few executions, the problematic and time consuming parts wereoptimized. Then, as the main result, the program detected almosthalf of a random input of one million malicious domains. Also, theresults showed that the program without Javascript interpreter de-tected less web pages than with it. During the implementation, severalissues were solved. For example, Google Chrome was not obtainingthe whole DOM of the website, so a Ąx for this issues was Ąxed bybuilding own Chromium project.

This bachelor thesis was a life test for me, because no cooperationwas sought on the project from the beginning to the end, except advicesfrom an advisor and consultant. Thanks to this fact, the thesis providedme a great amount of new knowledge. It speciĄcally expanded theknowledge of how malicious content works on the websites, howphishing attacks works and which way they can be detected. Thepractical part also provided a new asset of the ability to design andprogram an application as a whole and also of improving skills in thePython programming language.

38

Bibliography

1. HLAVA, Tomáš. Testování softwaru. 2017. Available also from: http:

//testovanisoftwaru.cz/. [Online; 14.12.2017].

2. NIDHIKA, Uppal; VINAY, Chopra. Design and Implementation in Au-tomatic Testing Tool Selenium IDE, RC and Web Driver. Department ofIT, DAVIET, 2012.

3. W3C. Document Object Model. 2017. Available also from: https://www.

w3.org/DOM/. [Online; 14.12.2017].

4. W3C. XPath. 2017. Available also from: https://www.w3.org/TR/

xpath/. [Online; 14.12.2017].

5. WATIR.COM. Watir documenatation. 2016. Available also from: https:

//watir.com/documentation/. [Online; 14.12.2017].

6. PEARCE, James. Huxley. 2016. Available also from: https://github.

com/facebookarchive/huxley. [Online; 14.12.2017].

7. SCRAPINGHUB. Splash. 2017. Available also from: https://splash.

readthedocs.io/en/stable/. [Online; 18.11.2017].

8. SAUCELABS. Saucelabs. 2016. Available also from: https://saucelabs.

com/. [Online; 14.12.2017].

9. BROWSERSTACK. BrowserStack. 2017. Available also from: https://

www.browserstack.com/. [Online; 14.12.2017].

10. STATCOUNTER. StatCounter. 2017. Available also from: http://gs.

statcounter.com/. [Online; 14.12.2017].

11. HIDAYAT, Ariya. PhantomJS. 2016. Available also from: http : / /

phantomjs.org/. [Online; 14.12.2017].

12. JOUANNEAU, Laurent. SlimerJS. 2016. Available also from: https:

//slimerjs.org/features.html. [Online; 14.12.2017].

13. SOFTWARE, Gargoyle. HtmlUnit. 2016. Available also from: http:

//htmlunit.sourceforge.net/. [Online; 14.12.2017].

14. ARKIN, Assaf. ZombieJS. 2017. Available also from: http://zombie.

js.org/. [Online; 14.12.2017].

15. SERAFIN, Jean-Philippe. Ghost.py. 2017. Available also from: http:

//jeanphix.me/Ghost.py/. [Online; 14.12.2017].

39

BIBLIOGRAPHY

16. BIDELMAN, Eric. Headless chrome. 2017. Available also from: https:

//developers.google.com/web/updates/2017/04/headless-

chrome/. [Online; 14.12.2017].

17. JAMES, Lance. Phishing Exposed. Syngress; 1 edition, 2016.

18. BURSZTEIN, Elie. How phishing works. 2015. Available also from: https:

//www.elie.net/blog/anti_fraud_and_abuse/how-phishing-

works. [Online; 14.12.2017].

19. OWAPS.ORG. Cross Site Scripting. 2016. Available also from: https:

//www.owasp.org/index.php/Cross- site_Scripting_(XSS).[Online; 14.12.2017].

20. MARTIN, Ben. Ransomware Strikes Websites. 2016. Available also from:https : / / blog . sucuri . net / 2016 / 01 / ransomware - strikes -

websites.html. [Online; 14.12.2017].

21. OWAPS.ORG. XSS Filter Evasion Cheat Sheet. 2017. Available also from:https : / / www . owasp . org / index . php / XSS _ Filter _ Evasion _

Cheat_Sheet. [Online; 14.12.2017].

22. GOOGLE. Redirection methods. 2011. Available also from: https://

code.google.com/archive/p/html5security/wikis/RedirectionMethods.

wiki. [Online; 14.12.2017].

23. AMIRI, Iraj; AKAMBI, O.A. A Machine-Learning Approach to PhishingDetection and Defense. Syngress, 2014.

24. MAURER, Max-Emanuel; HERZNER, Dennis. Using Visual WebsiteSimilarity for Phishing Detection and Reporting. 2012. Available alsofrom: https://www.medien.ifi.lmu.de/pubdb/publications/

pub/maurer2012chi/maurer2012chi.pdf. [Online; 14.12.2017].

25. GNU. GNU Wget Manual. 2016. Available also from: https://www.

gnu.org/software/wget/manual/wget.html. [Online; 14.12.2017].

26. RICHARDSON, Leonard. Beautiful Soup Documentation. 2016. Avail-able also from: https://www.crummy.com/software/BeautifulSoup/

bs4/doc/. [Online; 14.12.2017].

27. ALVAREZ, Victor Manuel. YARA - The pattern matching swiss knife formalware researchers. 2017. Available also from: https://virustotal.

github.io/yara/. [Online; 14.12.2017].

40

BIBLIOGRAPHY

28. GOOGLE. Google safe browsing. 2017. Available also from: https://

safebrowsing.google.com/. [Online; 14.12.2017].

41

Powered by TCPDF (www.tcpdf.org)

http://www.tcpdf.org

System for detection of websites with phishing and other ...

Documents

Transcript of System for detection of websites with phishing and other ...