Migrate\Import existing database into Neo4j

Posted on Posted in Data & Business Intelligence

We needed to migrate our existing user base data inside Neo4j. Before moving to Neo4j, we used several databases: Oracle LDAP, MySQL (EjabberD), Riak, Oracle RAC, Elasticsearch and more.

At the end of the day, we needed to migrate 2.65 billion nodes and 6.2 billion relationships.

This is our case: we started with neo4j-import tool, moved to neo4j-admin import tool and then back to neo4j-import tool due to bugs. At the end of the day, we loaded 2.65B nodes and 6.2B relationships in less than a day.

Neo4j-import tool

This wasn’t our first time with Neo4j. We used Neo4j in the past (2013) for PYMK (People You May Know) functionality. Back then, Neo4j version was 2.0.0 and we used neo4j-import tool to populate the database from csv files.
This time, Neo4j version was 3.1.1 and we decided to use neo4j-import again.

When we tried to execute the neo4j-import tool, we got a warning message that it’s deprecated and we should use the neo4j-admin import option:

Neo4j-admin import tool

When using the neo4j-admin import tool (and the neo4j-import tool), it’s important to note that you can only populate an empty database using neo4j-admin import. Once the neo4j-admin import is done, you won’t be able to execute again (without deleting the data):

CSV format

When using the neo4j-import / neo4j-admin import tool, you need to supply csv files. We loaded several types of nodes and relationships, but for the example we will use:

    • user_nodes.csv – a csv with 2.6 billion  rows. Each row will become a node in Neo4j.  The header, will be the name of the property. Each node must have a unique identifier in order to be able to create relationships between nodes using the import tool. In order to identify a column in the csv as the unique identifier, just add :ID to the column’s name. You can add a column :TYPE and the value would be the label of the node. You can also specify the label on the neo4j-admin import command instead. You can create multiple csv files for nodes, not just one.
    • Example:

    • users_rel.csv – a csv with 6.2 billion relationships. Each row will be a relationship in Neo4j. In this csv the header represents the nodes between we want to create a relationship and the type of the relationship. We can add more columns that would be the relationship properties. You can use multiple csv files, not just one.
    • Example:

Important:

  • If you need to load several types (labels) of nodes and relationships – you need to make sure to include the label in the header of the files.

  Example: Let’s say we have 2 types of nodes: Users and Groups and  2 types of relationships: (User)-[:FRIEND]-(User) and (User)-[:MEMBER]-(Group).

  In that case the header of the User file will look like:

 The header of the Group file will look like:

And the headers of the relationships files:

  • If you need to load several types (labels) of nodes and/or relationships – you nee to mention in the  neo4j-import command  the files with the same label in a consecutive order.

 Example:

  • Try to execute the batch import on a portion of the data (10k, 100k) – just to verify that the headers are correct. The import tool only validates the header once it’s loading the file, so you don’t want to find out after 10 hours that the process had failed because of a missing/wrong header in one of the files.
Neo4j-admin import options

As described on the Neo4j documentation for the import tool, there are several options you can configure when using neo4j-admin import (and neo4j-import):

  • –id-type string – this means that the ids that will use for identifying nodes will use as string. This does not create any index/constraint.
  • –stacktrace – enable printing of error stack traces.
  • –bad-tolerance – number of bad entries before the import is failing. the default is 1000. We increased it to 100K.
  • –skip-bad-relationships – whether or not to skip importing relationships between missing modes. the default is true.
  • –skip-duplicate-nodes – whether or not to skip importing nodes that have the same id (:ID), default is false, we changed it to true.
  • –database – the name of database directory. The default is graph.db.  If you’re changing it to be something else, keep in mind that you also need to change it on neo4j.conf file
Neo4j-admin import execution

This is the command we executed:

The spec of the server was Ubuntu 14.04.2 server with 20 CPU,140GB RAM and 1TB of SSD.

At first everything seems to be ok:

And then we waited, and waited..we weren’t sure how much we should wait. After 10 minutes, we noticed that it seems that the process is stuck: The CPU is idle, no IOPS, no memory is being used and no swap…

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

even though we added the –stacktrace true we didn’t see anything in the logs.

After reading online we decided to run the command using –multiline-fields true, because we suspected that perhaps we have special characters in the csv files which causing a problem.

Finally, we got an error message, something to work with…

This is a sample from /usr/share/neo4j/bin/import.report which shows us the nodes that don’t exist

Since we defined in the neo4j-admin import command –bad-tolerance 100000, we thought that it’s strange that we have that many missing nodes (more than 100,000).

When counting the number of lines in the logs (since every error is across multiple lines, it doesn’t reflect the number of errors, but just as a reference):

But we set a very high –bad-tolerance and it looks like it ignores it completely and fails every time.

In the stack trace it shows that it fails after 1000 relationships (which is the default –-bad-tolerance) so it doesn’t respect the flag passed to the command.
Here is the trace output:

After we approached Neo4j support, we verified that this is a bug in neo4j-admin import and we were advised to use Neo4j-import tool

Neo4j-import

Back to square one. We started with neo4j-import tool and looks like we will finish the job using it.

When we first executed the neo4j-import, we got an error message:

The issue we saw was “Caused by:Extra column not present in header on line” and we were advised to add –ignore-extra-columns, which determines whether or not to ignore extra columns in the data not specified by the header.

So now this is the full command we used:

And the output:

Finally, we loaded all nodes and all relationships. Some relationships weren’t loaded due to missing nodes, but the number was low (1200 out of 2.6B).

The following graphs will show the server’s CPU, memory and swap  during the load:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Creating Indices

Since we used the import tool, we now need to add indices and/or constrains to out database. We added 7 unique constrains to the database (using Cypher) and the execution time was almost 12 hours.

We executed the indices using neo4j-shell. Please note that you need to enable a remote shell server in the config file (don’t forget to disable once you’re done):

 

We put all the create statements in a file (.cql) and execute the neo4j-shell with nohup

As a reference, this was our database size after we loaded everything: Please note that the size is for our case, and can change from one use case to another, depend on the number of indices, number of nodes, number of properties (and type and their size), number of relationships and more.

And inside schema:

 

CPU and IOPS during indices creation:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *