Understanding Graph Databases: Traversals with Amazon Neptune and Gremlin

Understanding Graph Databases: Traversals with Amazon Neptune and Gremlin

Takahiro Iwasa
Takahiro Iwasa
9 min read
Graph Database Neptune

Introduction

I recently hosted a study meeting titled “Graph Database Introduction - Relationships Traversal with Amazon Neptune and Gremlin” (event link). In this blog post, I aim to share the content of that presentation to help you understand graph databases and explore the capabilities of Amazon Neptune.

You can find example code used in this post in my GitHub repository.

Prerequisites

Target Readers

This post is intended for readers who:

  • Are interested in graph databases and Amazon Neptune.
  • Have a basic understanding of relational databases and AWS.

Goals

  • Provide an overview of graph databases and Amazon Neptune.
  • Demonstrate graph traversal using Apache TinkerPop Gremlin.

What is a Graph Database

Key Concepts

Graph databases are optimized for storing and querying relationships between entities. Unlike relational or NoSQL databases, graph databases excel in scenarios where relationships are complex and performance is critical.

In graph databases:

  • Vertices (nodes) represent entities.
  • Edges define relationships between vertices.
  • Both vertices and edges can have properties (key-value pairs).
    • A graph with labeled vertices and edges is called a property graph.

Graph Representation

Applications of Graph Databases

Social Networks

Graph databases model relationships between people, as seen in platforms like Facebook and LinkedIn.

Knowledge Graphs

These graphs add contextual meaning to data, distinguishing between different concepts, such as “Apple” the fruit versus “Apple Inc.” the company. The term “Knowledge Graph” became widely adopted following the publication of Introducing the Knowledge Graph: things, not strings.

Identity Graphs

Identity graphs link multiple user identifiers across platforms, enabling comprehensive user behavior analysis. A well-known application of this is Customer 360, which provides a unified view of the customer across all touchpoints.

Fraud Detection

Fraud rings often mix legitimate transactions with illegal activities, making detection challenging. Graph databases excel in uncovering these patterns by visualizing relationships and identifying shared attributes like credit card numbers, phone numbers, or social security numbers, enabling effective fraud detection.

Real-World Example: Panama Papers

The Panama Papers investigation used a graph database to analyze leaked documents efficiently. Learn more in this Forbes article.

Comparing Databases: Graph DB, RDB, and NoSQL

FeatureGraph DBRDBNoSQL
Data StructureGraphTableVarious
SchemaLooseStrictLoose
PurposeRelationshipsEntity storageVarious
QueryingTraversalJoins/SubqueriesVarious
PerformanceVery HighAverage 1Very High
Query LanguagesGremlin
SPARQL
openCypher 2
Cypher
GQL
SQL-
  1. The performance of relational databases (RDBs) can be enhanced by using columnar databases, such as Amazon Redshift.
  2. Amazon Neptune now supports openCypher for production environments, starting with engine release 1.1.1.0. However, it’s important to note that Cypher and GQL are currently not supported.

Introduction to Amazon Neptune

Amazon Neptune is a fully managed graph database service capable of querying billions of relationships in milliseconds. It offers high availability, durability, and low latency.

Key Features

  • Cluster Scalability: Up to 15 read replicas.
  • Automatic Scaling: Storage expands automatically up to 64 TiB.
  • Data Redundancy: Maintains 6 copies across 3 availability zones.

Transactions in Neptune

Neptune manages read-only and mutation queries using distinct transaction isolation levels. Learn more in the documentation.

Read-Only Queries

For read-only queries, Neptune uses snapshot isolation within MultiVersion Concurrency Control (MVCC). This ensures that dirty reads, non-repeatable reads, and phantom reads are prevented by using a snapshot taken at the start of the transaction.

Mutation Queries

For mutation queries, Neptune uses READ COMMITTED isolation to prevent dirty reads. Additionally, Neptune locks the record set being read, ensuring that non-repeatable reads and phantom reads are avoided.

Querying with Gremlin

Amazon Neptune supports querying with Gremlin, an open-source graph traversal language. Developers can also use Graph Notebook for querying and visualization.

Traversal Examples

Sample Graph Data

This example uses a graph where vertices have an age property and edges define a weight property.

Use the following command to load the sample data provided above. The %%gremlin is a Jupyter Notebook magic command specifically designed for use with Neptune Workbench.

%%gremlin
// Drop existing data
g.V().drop()
%%gremlin
// Add Vertices
g.addV('person').property(id, 'A').property('age', 30)
 .addV('person').property(id, 'B').property('age', 25)
 .addV('person').property(id, 'C').property('age', 35)
 .addV('person').property(id, 'D').property('age', 20)
 .addV('person').property(id, 'E').property('age', 18)
 .addV('person').property(id, 'F').property('age', 25)
 .addV('person').property(id, 'G').property('age', 10)
 .addV('person').property(id, 'H').property('age', 15)
%%gremlin
// Add Edges
g.V('A').addE('know').to(g.V('B')).property('weight', 1.0)
 .V('A').addE('know').to(g.V('C')).property('weight', 0.9)
 .V('A').addE('know').to(g.V('H')).property('weight', 0.5)
 .V('B').addE('know').to(g.V('D')).property('weight', 0.8)
 .V('B').addE('know').to(g.V('E')).property('weight', 0.4)
 .V('C').addE('know').to(g.V('F')).property('weight', 0.5)
 .V('C').addE('know').to(g.V('G')).property('weight', 0.6)
 .V('D').addE('know').to(g.V('E')).property('weight', 0.7)
 .V('H').addE('know').to(g.V('E')).property('weight', 1.0)
 .V('H').addE('know').to(g.V('G')).property('weight', 1.0)

Example 1: Retrieving All Vertices

%%gremlin
// Extract Vertices
g.V()
 .project('id', 'properties') // Projection
 .by(id()).by(valueMap()) // valueMap returns properties of vertices.

Result:

RowData
1{'id': 'A', 'properties': {'age': [30]}}
2{'id': 'B', 'properties': {'age': [25]}}
3{'id': 'C', 'properties': {'age': [35]}}
4{'id': 'D', 'properties': {'age': [20]}}
5{'id': 'E', 'properties': {'age': [18]}}
6{'id': 'F', 'properties': {'age': [25]}}
7{'id': 'G', 'properties': {'age': [10]}}
8{'id': 'H', 'properties': {'age': [15]}}

Example 2: Traversing Connections

Retrieve all persons older than 25 and connected to ‘A’ within two levels:

%%gremlin
// Extract persons (entities) which are older than 25 years old and connected from A up to 2nd.
g.V('A')
 .repeat(outE().inV()).times(2).emit() // Repeat traversal of adjacent vertices twice
 .has('age', gte(25)) // Greater than or equal 25 years old
 .project('id', 'age')
 .by(id()).by(values('age'))

Result:

RowData
1{'id': 'B', 'age': 25}
2{'id': 'C', 'age': 35}
3{'id': 'F', 'age': 25}

Example 3: Filtering by Weight

Find persons where the multiplied weight exceeds 0.5:

%%gremlin
// Start traversal at A which extracts vertices that have a multiplied weight value greater than 0.5
g.withSack(1.0f).V('A') // Sack can be used to store temporary data
 // Multiply a weight value of an out-directed edge by a sack value, and traverse all in-directed vertices
 .repeat(outE().sack(mult).by('weight').inV().simplePath()).emit()
 .where(sack().is(gt(0.5))) // A sack value greater than 0.5
 .dedup() // deduplication
 .project('id', 'weight')
 .by(id).by(sack())

Result:

RowData
1{'id': 'B', 'weight': 1.0}
2{'id': 'C', 'weight': 0.9}
3{'id': 'D', 'weight': 0.8}
4{'id': 'G', 'weight': 0.54}
5{'id': 'E', 'weight': 0.5599…}

Visualizing Graphs

Neptune Workbench offers tools to visualize graphs interactively. Refer to the official documentation for more details: Neptune Visualization.

To visualize the graph, execute the following command:

%%gremlin -d T.id -de weight
// -d specifies the vertex property to display
// -de specifies the edge property to display

// Execute traversal from example 3
g.withSack(1.0f).V('A') // Sack is used to store temporary data
 .repeat(outE().sack(mult).by('weight').inV().simplePath()).emit() // Traverse with edge weight
 .where(sack().is(gt(0.5))) // Filter paths where the sack value > 0.5
 .dedup() // Remove duplicate paths
 .path() // Extract path data
 .by(elementMap()) // Display properties of vertices and edges

Example Output:

RowData
1path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '8ebe47fa-901b-c6d3-a11f-0a9bf0ce8aa2', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'B', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 1.0}, {<T.id: 1>: 'B', <T.label: 4>: 'person', 'age': 25}]
2path[{<T.id: 1>: 'A', <T.label: 4>: 'person', 'age': 30}, {<T.id: 1>: '7abe47fa-901c-c394-4bed-6dce7defa3f9', <T.label: 4>: 'know', <Direction.IN: 2>: {<T.id: 1>: 'C', <T.label: 4>: 'person'}, <Direction.OUT: 3>: {<T.id: 1>: 'A', <T.label: 4>: 'person'}, 'weight': 0.9}, {<T.id: 1>: 'C', <T.label: 4>: 'person', 'age': 35}]

Once executed, navigate to the Graph tab in Neptune Workbench to visualize the result. The interface supports drag, zoom-in, and zoom-out operations, allowing for intuitive graph exploration.

Conclusion

Graph databases provide significant advantages in handling relationships, and Amazon Neptune makes deploying such databases seamless on AWS. With Gremlin, you can efficiently traverse and analyze complex data relationships.

If you’re looking to explore graph databases, Amazon Neptune is an excellent starting point.

Happy Coding! 🚀


Appendix 1: Representing Tree Structures in Relational Databases (RDB)

Tree structures can be modeled in relational databases (RDB) using various approaches. In some scenarios, graph databases may not be necessary. However, it is important to note that the Naive Tree model is often suboptimal for many cases, as discussed in “SQL Antipatterns”.

ModelDescriptionProsCons
Naive Treet1.id = t2.parent_id
  • Simple implementation (Adjacency List)
  • Difficult to extract non-adjacent nodes
  • Complex SQL
  • Low performance
Path Enumerationpath LIKE '1/2%'
  • Simplifies extracting non-adjacent nodes
  • Complex INSERT/UPDATE/DELETE operations
  • Limited by column max length
Nested SetsLeft > 1 AND Right < 6
  • Efficient for querying non-adjacent nodes
  • Complex INSERT/UPDATE operations
  • Limited by column max length
  • Non-intuitive structure
Closure TableSeparate table for the tree
  • Efficient for querying all nodes
  • Handles INSERT/UPDATE/DELETE easily
  • Data can grow significantly in size
  • INSERT/UPDATE/DELETE can have low performance

Each approach has its advantages and limitations, making the choice dependent on the specific requirements of your application.

Appendix 2: Transactions

Dirty Read

When Tx1 updates the row, Tx2 reads the row, and Tx1 fails or rollbacks, Tx2 reads data which has not been reflected.

Non-repeatable Read

When Tx1 reads the row, Tx2 updates or deletes the row and commits, and Tx1 reads the row again, Tx1 reads committed data different from the previous one.

Phantom Read

When Tx1 reads the records, Tx2 inserts or deletes some rows and commits, and Tx1 reads the rows again, Tx1 reads the rows different from the previous ones.

Isolation Levels

LevelDirty ReadNon-repeatable ReadPhantom Read
READ UNCOMMITTEDPossiblePossiblePossible
READ COMMITTEDNot possiblePossiblePossible
REPEATABLE READNot possibleNot possiblePossible
SERIALIZABLENot possibleNot possibleNot possible

Appendix 3: Querying with Relational Databases (RDB)

When storing data in an RDB, querying relationships can become increasingly complex. SQL queries often grow in complexity, and representing interconnected data within a row-column structure is unintuitive.

Takahiro Iwasa

Takahiro Iwasa

Software Developer at KAKEHASHI Inc.
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Now, building a new prescription data collection platform at KAKEHASHI Inc. Japan AWS Top Engineers 2020-2023.