Implementation of a Motion Detection System

WASET Nov. 2013, Malaga · Large Scale Systems

Architecturesof

Large Scale Systems (LSS)

Prof. Arne KoschelIrina Astrova, Elena Deutschkämer, Jacob Ester and Johannes Feldmann


Agenda• Introduction and definition• LSS Hard- and software• Components

– Distributed file systems– Databases– Hadoop / Map Reduce – Caching

• Practical example– Facebook

• Conclusion


INTRODUCTION & DEFINITION


What means „large scale“?• No single exact definition• Criteria may be

– amount of data processed– number of hardware elements– number of people involved– number of system purposes and processes

• Problems– performance, reliability, complexity, development process


Differences to traditional systemsCharacteristic Traditional IT-System Large scale systemGovernance Singular dominant

influenceMultiple, conflicting influences

Duration of life

Defined at the moment of designing

Infinite

Flow of information

Well-understood internal flow, known sources

Changing flow of information, new sources

Complexity Optimized Highly complex, not optimized

Elements Services, components Systems, services


Differences to traditional systems

Traditional software system

Large scale system = System of systems


Scalability• Vertical scalability (scale up)

– Replace system through a more powerful system

• Horizontal scalability (scale out)– Adding extra server(s) to the system– Of interest for large scale systems

Server

Server

Server Serve

rServer

Server


HARD- AND SOFTWARE


LSS Hard- and software• Hardware

– Research in specialized hardware• Open Computer Project

– Instructions and specifications for constructing very efficient servers

– On the market• Scalable Servers by Intel and HP

• Software– Various software approaches for LSS (especially for web based LSS)• Frameworks and algorithms designed for LSS• Lots of different databases and file-systems• Caching mechanism


COMPONENTSDistributed File Systems


Distributed File Systems• Provide data access to many clients• Use a network and an access protocol• May offer mechanisms for replication and fault tolerance

• Highly optimized in LSS– Google File System, Amazon S3, Facebook Haystack, etc.


Example: Google File System (GFS)• Assumptions made

– Commodity hardware that is expected to fail– Huge files (multi-GB)– workload consists of ...

•sequential or random reads•mostly sequential writes (streams), no random writes

•practically no over-writing– Bandwidth is more important than low latency


GFS Architecture


GFS features and benefits• Seperation of data flow and control data• Metadata (addressing) are in-memory on master side

• Large chunks (64 MB)• Relaxed consistency model

– Atomic writes and appends, garbage collection• High availablity

– balancing, fast recovery, master and chunk replication, snapshots,


COMPONENTSDatabases


• Main Types (Core NoSQL)– Column Store

• Any key to any key-value-pairs Column Family

– Document Store• Structured data collections like JSON

– Key/Value Store• Schema of key and value (Strings, Hashes, Sets, Lists)

– Graph DBs

Databases


• Column store• Column Families = set of column keys• Tablet = row range• 3 main components

– library, master server, tablet server• Scalability

– dynamic adding of tablet servers• Proprietary database by Google

– Open-source solution: Hbase

Example: BigTable


COMPONENTSHadoop / Map-Reduce


Hadoop• Free, java-based Framework for LSS• Inspired by Google’s MapReduce and Google File System (GFS)• Main contents of Hadoop-Framework

– Hadoop Common• Provides access to the supported file systems

– HDFS, Amazon S3, CloudStore, FTP file system, Read Only HTTP and HTTPS file systems

• Necessary Jar-Files and scripts– Hadoop Distributed File System (HDFS)

• Distributed, scalable portable file system• On default, data is stored on three nodes: two on the same rack and

one on a different rack– Hadoop MapReduce

• Consists of one Job Tracker (get the Map Reduce Jobs) and lots of Task Tracker nodes

– Job Tracker knows, which node contains the data and which Task Tracker is nearby (keep the work as close to the data as possible)

• Uses Map() and Reduce() Functions next slides


MapReduce-Algorithm• Uses a list of key-value-pairs to calculate a new list of key-value-pairs

• Map-Function– Job-tracker partitions the input in smaller sub-problems and distribute them to the task-trackers

– Writes the results into a intermediate storage• Reduce-Function

– Gets the results from the intermediate storage– Calculates the final result of the main-problem

)],(),...,,[()],(),...,,[(*)()*(

1111 nnnn wlwlvkvkWLVK

)],(),...,,[(),(*)(

11 rkrk xlxlvkWLVK

],...,[]),...,[,(**

11 mlsl wwyylWWL


Map-ReduceDa

ta

Split4Split5Split6

Workermap()

Workermap()

Workermap()

FileFile

FileFile

FileFile

Workerreduce()

Workerreduce()

File

File

Split0Split1Split2Split3

Input-Data Map-stage Tmp storage Reduce-stage Result


HDFS• HDFS (Hadoop distributed file system)

– Consists of one name node and many data nodes•Name node stores the metadata of the HDFS•Data nodes store the data

– Cluster of data nodes forms the HDFS cluster

– Uses TCP/IP layer and RPC to communicate– Replicates data across multiple hosts


HDFS Architecture

Task Tracke

r

Data node

job tracker

Name node

Task tracker

Data node

MapReduce LayerHDFS Layer

Task Tracke

r

Data node

Task Tracke

r

Data node

Task Tracke

r

Data node

[…]

Master

Slave Slave Slave Slave


COMPONENTSCaching


Caching• Store frequently accessed data in fast memory near target location

• Main goal in web applications: avoid database access

• Highly efficient when results are allowed to be slightly ‘out of date’


Memcached• „short-term memory for applications“ • Main idea: provide one single cache used by several web servers instead of may independent server-related caches


Memcached – key features• Free and open source www.memcached.org• Implemented as hash table, distributed across multiple machines

• Client / Server architecture• APIs and integration available for many languages

• Seperates server and memory units• Limitations

– no permanent persistence– no complex queries

http://www.memcached.org/


PRACTICAL EXAMPLESFacebook


Facebook• Facts

– 200 billion page calls per month– 15.000 websites use Facebook Connect– 9,5% of worldwide internet traffic– 9 Datacenters

• Scales across multiple Datacenters– 60.000 Servers (June 2010)– Based on LAMP (Linux, Apache, MySQL, PHP)– One of the world biggest MySQL cluster– Some PHP functions are converted into faster C++– Main framework uses RPCs for additional services and extensions• Services use Hadoop, Cassandra, Hive, Scribe, …


But ...–LAMP is not perfect at all

• PHP is stateless• Data is remote

Services– Store code closer to data– Compiled environment is more efficient

• Facebook Messaging (Chat, Status Updates, Messages, E-Mail, SMS)– Some parts are written in Erlang

– Uses Hadoop/HDFS• More than 3200 jobs/day• 800.000 tasks(MapReduce)/day

• Scans 55TB data per day• 15TB of compressed output data to HDFS

High-level architecture

PHPMemcachedMySQL

AdServerSearchNetworkNewsfeedBlogfeedCSSParserMobile

Lamp + Services

ThriftScribe

…

• PHP– Good library support for web

applications– Active developer community– Good for rapid iteration

• Memcached– Used to reduce database load– More than 25TB in memory cache– Uses UDP

• Reduces overhead from TCP connection buffers

• Application-level flow control, sequenzing…

• MySQL– Mostly all data is identified by GUID– Load balancer at physical node level

• More than 500 physical db nodes– Extended query engine for cross

datacenter replication


CONCLUSION


Conclusion• Large scale systems …

– exceed traditional applications in various dimensions

– are systems of systems– combine various technologies to unique, adopted solutions

– choice of technologies depends on requirements• Architecture enables dynamic growth of the entire system

Implementation of a Motion Detection System

Documents

Transcript of Implementation of a Motion Detection System