Graph databases

35
Graph Databases: Principles NoSQL Databases

Transcript of Graph databases

Graph Databases: Principles

NoSQL Databases

Agenda

● Graph Databases: Mission, Data, Example● A Bit of Graph Theory

o Graph Representationso Improving Data Locality (efficient storage)o Graph Partitioning and Traversal Algorithmo Types of Queries

● Graph Databases Neo4j: Basics

Graph Databases: Example

source: Sadalage & Fowler: NoSQL Distilled, 2012

Graph Databases: Mission

● To store entities and relationships between themo Nodes are instances of objectso Nodes have properties, e.g., nameo Edges connect nodes and have directional significanceo Edges have types e.g., likes, friend, …

● Nodes are organized by relationshipso Allow to find interesting patternso example: Get all nodes that are “employee” of “Big Company”

and that “likes” “NoSQL Distilled”

Graph Databases: Representatives

Ranked list: http://db-engines.com/en/ranking/graph+dbms

Basic Characteristics

● Different types of relationships between nodeso To represent relationships between domain entitieso Or to model any kind of secondary relationships

Category, path, time-trees..● No limit to the number and kind of relationships

● Relationships have: type, start node, end node, own propertieso e.g., “since when” did they become friends

Relationship Properties: Example

source: Sadalage & Fowler: NoSQL Distilled, 2012

Example: Neo4j

Node martin = graphDb.createNode();martin.setProperty("name", "Martin");Node pramod = graphDb.createNode();pramod.setProperty("name", "Pramod");

Relationship m2p = martin.createRelationshipTo(pramod, FRIEND);Relationship p2m = pramod.createRelationshipTo(martin, FRIEND);

m2p.setProperty("quality", "a good one");m2p.setProperty("quality", "a good one");● Undirected edge:

o Relationship between the nodes in both directionso INCOMING and OUTGOING relationships from a node

A Bit of a Theory

● Data: a set of entities and their relationshipso => we need to efficiently represent graphs

● Basic operations: finding the neighbours of a node, checking if two nodes are connected by an edge, updating the graph structure, …

o => we need efficient graph operations● graph G = (V, E) is commonly modelled as

o set of nodes (vertices) Vo set of edges Eo n = |V|, m = |E|

● Which data structure to use?

Data Structure: Adjacency Matrix

● Two-dimensional array A of n n⨉ Boolean valueso Indexes of the array = node identifiers of the grapho Boolean value Aij indicates whether nodes i, j are connected

● Variants: o (Un)directed graphso Weighted graphs…

Adjacency Matrix: Example

● Pros:o Adding/removing edges o Checking if 2 nodes are

connected

● Cons:o Quadratic space: O(n2)o We usually have sparse graphso Adding nodes is expensiveo Retrieval of all the

neighbouring nodes takes linear time: O(n)

Data Structure: Adjacency List

● A set of lists, each enumerating neighbours of one nodeo A vector of n pointers to adjacency lists

● Undirected graph:o An edge connects nodes i and j o => the adjacency list of i contains node j and vice versa

● Often compressedo Exploiting regularities in graphs, difference from other

nodes, …

Adjacency List: Example

● Pros:o Getting the neighbours of a nodeo Cheap addition of nodeso More compact representation of

sparse graphs

● Cons:o Checking if there is an edge

between two nodes Optimization: sorted lists =>

logarithmic scan, but also logarithmic insertion

Basic Graph Algorithms

● Access all nodes:o Breadth-first Search (BFS)o Depth-first Search (DFS)

● Single-source shortest path problemo BFS (unweighted), o Dijkstra (nonnegative weights), o Bellman-Ford algorithm

Special consideration to cyclic pathes

http://en.wikipedia.org/wiki/Shortest_path_problem

Improving Data Locality

● Performance of the read/write operationso Depends also on physical organization of the datao Objective: Achieve the best “data locality”

● Spatial locality: o if a data item has been accessed, the nearby data items

are likely to be accessed in the following computations e.g., during graph traversal

● Strategy: o in graph adjacency matrix representation, exchange rows

and columns to improve the disk cache hit ratioo Specific methods: BFSL, Bandwidth of a Matrix, ...

Breadth First Search Layout (BFSL)

● Input: vertices of a graph● Output: a permutation of the vertices

○ with better cache performance for graph traversals

● BFSL algorithm:1. Select a node (at random, the origin of the traversal)2. Traverse the graph using the BFS alg.

generating a list of vertex identifiers in the order they are visited

3. Take the generated list as the new vertices permutation

Breadth First Search Layout (2)

● Let us recall: Breadth First Search (BFS)o FIFO queue of frontier vertices

● Pros: optimal when starting from the same node● Cons: starting from other nodes

○ The further, the worse

Compare to Depth First Search

Bandwidth of a Matrix

● Graphs may be seen as matriceso Adjacency matrix

● Locality problem -> minimum bandwidth problemo Bandwidth of a row in a matrix = the maximum distance

between nonzero elements, where one is left of the diagonal and the other is right of the diagonal

o Bandwidth of a matrix = maximum bandwidth of its rows● Low bandwidth matrices are more cache friendly

o Non zero elements (edges) clustered about the diagonal

Matrix Bandwidth Minimization

BMP Approx.: Cuthill-McKee (1969)

● Bandwidth minimization technique o for sparse matrices

● Re-labels the vertices according to a sequenceo resulting from a heuristically guided traversal

● Algorithm:1. The first node (where the traversal starts) is the node

with the smallest degree in the whole graph2. Other nodes are labeled sequentially as they are visited

by BFS traversal In addition, the heuristic prefers those nodes with

smaller degree

Graph Partitioning

● Some graphs are too large to be fully loaded into the main memory of a single computero Usage of secondary storage degrades the performance of

graph applicationso Scalable solution: distribute the graph on multiple nodes

● We need to partition the graph reasonablyo Usually for a particular (set of) operation(s)

The shortest path, finding frequent patterns, BFS, spanning tree search

One-Dimensional Partitioning

● Aim: partitioning the graph to solve BFS efficientlyo Distributed into shared-nothing parallel systemo Partitioning of the adjacency matrix

● 1D partitioning:o Matrix rows are randomly assigned to the P nodes

(processors) in the systemo Each vertex and the edges emanating from it are owned by

one processor

Two-dimensional Partitioning

● 2D partitioningo Processors are logically

arranged in an P = R C ⨉ mesh We have R processor rows,

C processor columnso Adjacency matrix is divided into

C blocks of columns and R C ⨉blocks of rows

o Each processor owns C blocks 1D partitioning

= 2D partitioning with R = 1 OR C = 1

Neo4j: Basic Info

● Open source graph databaseo The most popular

● Initial release: 2007● Written in: Java● OS: cross-platform● Stores data as nodes connected

by directed, typed relationshipso With properties on botho Called the “property graph”

Neo4j: Basic Features

● intuitive – a graph model for data representation● reliable – with full ACID transactions● durable and fast – disk-based, native storage engine ● scalable – up to several billion nodes/relationships/properties● highly-available – when distributed across multiple machines● expressive – powerful, human readable graph query language● fast – powerful traversal framework● embeddable - in Java program● simple – accessible by REST interface & Java API

Graphs (Neo4j) vs. RDBMS

● RDBMS designed for a single type of relationshipo “Who is my manager”

● Adding another relationship usually means a lot of schema changes

● In RDBMS we model the graph beforehand based on the traversal we wanto If the traversal changes, the data will have to changeo Graph DBs: the relationship is not calculated but persisted

Graphs (Neo4j) vs. RDBMS (2)

● RDBMS is optimized for aggregated data● Neo4j is optimized for highly connected data

source: http://neo4j.com/docs/stable/

Graphs (Neo4j) vs. Key-value Stores

● Key-value model is for lookups of simple values or lists● With Neo4j, one can elaborate simple data structures into more

complex, interconnected data

● Column-family stores can be considered a step in evolution of key-value stores

● The value contains a list of columns

source: http://neo4j.com/docs/stable/

Graphs (Neo4j) vs. Document Stores

● Document stores are for data that can be represented as a treeo References to other

documents are more expressive

● Graph databases are for free graph data

source: http://neo4j.com/docs/stable/

Graph DBs: Suitable Use Cases

● Connected Datao Social networkso Any link-rich domain is well suited for graph databases

● Routing, Dispatch, and Location-Based Serviceso Node = location or address that has a deliveryo Graph = nodes where a delivery has to be madeo Relationships = distance

● Recommendation Engineso “your friends also bought this product”o “when buying this item, these others are usually bought”

Graph DBs: When Not to Use

● If we want to update all or a subset of entitieso Changing a property on many nodes is not straightforward

e.g., analytics solution where all entities may need to be updated with a changed property

● Some graph databases may be unable to handle lots of datao Distribution of a graph is difficult

Questions?

Please, any questions? Good question is a gift...

References

● RNDr. Irena Holubova, Ph.D. MMF UK course NDBI040: Big Data Management and NoSQL Databases

● Sherif Sakr - Eric Pardede: Graph Data Management: Techniques and Applications

● Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional, 192 p.

● http://neo4j.com/docs/stable/