Graph databases
-
Upload
uni-bremen -
Category
Documents
-
view
1 -
download
0
Transcript of Graph databases
Agenda
● Graph Databases: Mission, Data, Example● A Bit of Graph Theory
o Graph Representationso Improving Data Locality (efficient storage)o Graph Partitioning and Traversal Algorithmo Types of Queries
● Graph Databases Neo4j: Basics
Graph Databases: Mission
● To store entities and relationships between themo Nodes are instances of objectso Nodes have properties, e.g., nameo Edges connect nodes and have directional significanceo Edges have types e.g., likes, friend, …
● Nodes are organized by relationshipso Allow to find interesting patternso example: Get all nodes that are “employee” of “Big Company”
and that “likes” “NoSQL Distilled”
Graph Databases: Representatives
Ranked list: http://db-engines.com/en/ranking/graph+dbms
Basic Characteristics
● Different types of relationships between nodeso To represent relationships between domain entitieso Or to model any kind of secondary relationships
Category, path, time-trees..● No limit to the number and kind of relationships
● Relationships have: type, start node, end node, own propertieso e.g., “since when” did they become friends
Example: Neo4j
Node martin = graphDb.createNode();martin.setProperty("name", "Martin");Node pramod = graphDb.createNode();pramod.setProperty("name", "Pramod");
Relationship m2p = martin.createRelationshipTo(pramod, FRIEND);Relationship p2m = pramod.createRelationshipTo(martin, FRIEND);
m2p.setProperty("quality", "a good one");m2p.setProperty("quality", "a good one");● Undirected edge:
o Relationship between the nodes in both directionso INCOMING and OUTGOING relationships from a node
A Bit of a Theory
● Data: a set of entities and their relationshipso => we need to efficiently represent graphs
● Basic operations: finding the neighbours of a node, checking if two nodes are connected by an edge, updating the graph structure, …
o => we need efficient graph operations● graph G = (V, E) is commonly modelled as
o set of nodes (vertices) Vo set of edges Eo n = |V|, m = |E|
● Which data structure to use?
Data Structure: Adjacency Matrix
● Two-dimensional array A of n n⨉ Boolean valueso Indexes of the array = node identifiers of the grapho Boolean value Aij indicates whether nodes i, j are connected
● Variants: o (Un)directed graphso Weighted graphs…
Adjacency Matrix: Example
● Pros:o Adding/removing edges o Checking if 2 nodes are
connected
● Cons:o Quadratic space: O(n2)o We usually have sparse graphso Adding nodes is expensiveo Retrieval of all the
neighbouring nodes takes linear time: O(n)
Data Structure: Adjacency List
● A set of lists, each enumerating neighbours of one nodeo A vector of n pointers to adjacency lists
● Undirected graph:o An edge connects nodes i and j o => the adjacency list of i contains node j and vice versa
● Often compressedo Exploiting regularities in graphs, difference from other
nodes, …
Adjacency List: Example
● Pros:o Getting the neighbours of a nodeo Cheap addition of nodeso More compact representation of
sparse graphs
● Cons:o Checking if there is an edge
between two nodes Optimization: sorted lists =>
logarithmic scan, but also logarithmic insertion
Basic Graph Algorithms
● Access all nodes:o Breadth-first Search (BFS)o Depth-first Search (DFS)
● Single-source shortest path problemo BFS (unweighted), o Dijkstra (nonnegative weights), o Bellman-Ford algorithm
Special consideration to cyclic pathes
http://en.wikipedia.org/wiki/Shortest_path_problem
Improving Data Locality
● Performance of the read/write operationso Depends also on physical organization of the datao Objective: Achieve the best “data locality”
● Spatial locality: o if a data item has been accessed, the nearby data items
are likely to be accessed in the following computations e.g., during graph traversal
● Strategy: o in graph adjacency matrix representation, exchange rows
and columns to improve the disk cache hit ratioo Specific methods: BFSL, Bandwidth of a Matrix, ...
Breadth First Search Layout (BFSL)
● Input: vertices of a graph● Output: a permutation of the vertices
○ with better cache performance for graph traversals
● BFSL algorithm:1. Select a node (at random, the origin of the traversal)2. Traverse the graph using the BFS alg.
generating a list of vertex identifiers in the order they are visited
3. Take the generated list as the new vertices permutation
Breadth First Search Layout (2)
● Let us recall: Breadth First Search (BFS)o FIFO queue of frontier vertices
● Pros: optimal when starting from the same node● Cons: starting from other nodes
○ The further, the worse
Compare to Depth First Search
Dijksta Algorithm• https://www.youtube.com/watch?v=WN3Rb9wVYDY
Bandwidth of a Matrix
● Graphs may be seen as matriceso Adjacency matrix
● Locality problem -> minimum bandwidth problemo Bandwidth of a row in a matrix = the maximum distance
between nonzero elements, where one is left of the diagonal and the other is right of the diagonal
o Bandwidth of a matrix = maximum bandwidth of its rows● Low bandwidth matrices are more cache friendly
o Non zero elements (edges) clustered about the diagonal
BMP Approx.: Cuthill-McKee (1969)
● Bandwidth minimization technique o for sparse matrices
● Re-labels the vertices according to a sequenceo resulting from a heuristically guided traversal
● Algorithm:1. The first node (where the traversal starts) is the node
with the smallest degree in the whole graph2. Other nodes are labeled sequentially as they are visited
by BFS traversal In addition, the heuristic prefers those nodes with
smaller degree
Graph Partitioning
● Some graphs are too large to be fully loaded into the main memory of a single computero Usage of secondary storage degrades the performance of
graph applicationso Scalable solution: distribute the graph on multiple nodes
● We need to partition the graph reasonablyo Usually for a particular (set of) operation(s)
The shortest path, finding frequent patterns, BFS, spanning tree search
One-Dimensional Partitioning
● Aim: partitioning the graph to solve BFS efficientlyo Distributed into shared-nothing parallel systemo Partitioning of the adjacency matrix
● 1D partitioning:o Matrix rows are randomly assigned to the P nodes
(processors) in the systemo Each vertex and the edges emanating from it are owned by
one processor
Two-dimensional Partitioning
● 2D partitioningo Processors are logically
arranged in an P = R C ⨉ mesh We have R processor rows,
C processor columnso Adjacency matrix is divided into
C blocks of columns and R C ⨉blocks of rows
o Each processor owns C blocks 1D partitioning
= 2D partitioning with R = 1 OR C = 1
Neo4j: Basic Info
● Open source graph databaseo The most popular
● Initial release: 2007● Written in: Java● OS: cross-platform● Stores data as nodes connected
by directed, typed relationshipso With properties on botho Called the “property graph”
Neo4j: Basic Features
● intuitive – a graph model for data representation● reliable – with full ACID transactions● durable and fast – disk-based, native storage engine ● scalable – up to several billion nodes/relationships/properties● highly-available – when distributed across multiple machines● expressive – powerful, human readable graph query language● fast – powerful traversal framework● embeddable - in Java program● simple – accessible by REST interface & Java API
Graphs (Neo4j) vs. RDBMS
● RDBMS designed for a single type of relationshipo “Who is my manager”
● Adding another relationship usually means a lot of schema changes
● In RDBMS we model the graph beforehand based on the traversal we wanto If the traversal changes, the data will have to changeo Graph DBs: the relationship is not calculated but persisted
Graphs (Neo4j) vs. RDBMS (2)
● RDBMS is optimized for aggregated data● Neo4j is optimized for highly connected data
source: http://neo4j.com/docs/stable/
Graphs (Neo4j) vs. Key-value Stores
● Key-value model is for lookups of simple values or lists● With Neo4j, one can elaborate simple data structures into more
complex, interconnected data
● Column-family stores can be considered a step in evolution of key-value stores
● The value contains a list of columns
source: http://neo4j.com/docs/stable/
Graphs (Neo4j) vs. Document Stores
● Document stores are for data that can be represented as a treeo References to other
documents are more expressive
● Graph databases are for free graph data
source: http://neo4j.com/docs/stable/
Graph DBs: Suitable Use Cases
● Connected Datao Social networkso Any link-rich domain is well suited for graph databases
● Routing, Dispatch, and Location-Based Serviceso Node = location or address that has a deliveryo Graph = nodes where a delivery has to be madeo Relationships = distance
● Recommendation Engineso “your friends also bought this product”o “when buying this item, these others are usually bought”
Graph DBs: When Not to Use
● If we want to update all or a subset of entitieso Changing a property on many nodes is not straightforward
e.g., analytics solution where all entities may need to be updated with a changed property
● Some graph databases may be unable to handle lots of datao Distribution of a graph is difficult
References
● RNDr. Irena Holubova, Ph.D. MMF UK course NDBI040: Big Data Management and NoSQL Databases
● Sherif Sakr - Eric Pardede: Graph Data Management: Techniques and Applications
● Sadalage, P. J., & Fowler, M. (2012). NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional, 192 p.
● http://neo4j.com/docs/stable/