Systems for Big Data

18
Systems for Big Data Srihari Srinivasan ThoughtWorks

Transcript of Systems for Big Data

Systems for Big Data

Srihari Srinivasan ThoughtWorks

Platform for software development

Web 2.x

Its Cloud Computing

Its Cloud Computing

Its Machine Learning

Its Cloud Computing

Its Distributed and Service Oriented Its

Machine Learning

Coming up

• Going distributed - When and Why?

• The landscape of Big Data systems - What are the apps?

When do we go distributed?

• A truly distributed design is usually a second/third generation solution

• Amazon started off as simple web application talking to a database 15+ years ago

• Twitter started out as a simple Ruby on Rails application talking to MySQL in 2006

When do we go distributed?

• As the application’s / organization’s complexity grows

• Data, request volume is too large for a single machine

• Your software needs to be deployed in multiple data centers

• Your teams deliver software in the form of services

Courtesy : Jeff Dean’s LADIS 2009 Keynote

Designing systems for scale

• Many production grade systems have been built and written about in recent times

• Need for a taxonomy that describes the big data systems landscape

A taxonomy for distributed systems

• Distributed Storage Systems

• Distributed Applications

• Monitoring & Management

• Personalization & Recommendation

Distributed Storage

• Distributed Filesystems

• Distributed/Parallel Databases

• Messaging and Notification engines

Distributed Filesystems

• Allows clients to access files from multiple networked hosts

• Clients don’t access underlying block storage directly, go through protocols

• Modern DFSs are good at providing replication & fault tolerance

Distributed Databases

• A database engine that allows storage and retrieval across different machines in a network, a.k.a NoSQL databases.

• Apache Hive, Amazon Dynamo, HadoopDB, FB Cassandra, Google Bigtable

• They tend to be non relational, distributed, open-source and horizontally scalable

• Are schema free, easy support for replication, eventually consistent (BASE over ACID)

Distributed Apps

• Data parallel programming frameworks

• Graph processing engines

• P2P content delivery

• Multi tenanted SaaS applications

• Content delivery networks

Monitoring and Management

• Distributed debuggers, tracers and profiling applications

• Monitoring systems

Personalization & Recommendation

• Recommendation engines

• Sentiment analyzers

• Personalized news & content discovery systems

</presentation>Visit

www.systemswemake.com

Follow on Twitter @systems_we_make