This post is about performance issues we experienced with Neo4j.
First, let’s discuss shortly about the benefits of a cluster:
why to use cluster:
The benefits from using cluster are:
- High Availability: In case of a failure in a server (or servers, depends on the cluster topology), the Neo4j service is up and running.
- Disaster recovery: The ability to recover from major service outages (greater than can be accommodated by the redundant capacity in a continuously available cluster), such as data center outages, physical network outages, etc. In Neo4j you can add another server on a different data center.
- Analytics & Reporting: The ability to execute analytical queries on your graph or any ad-hoc request without without over loading your own production cluster. You can add a dedicate read replica and use it internally for analytics, backups and more.
- High throughput: Each node in Neo4j contains all the data. So the more nodes you have the more requests the cluster can handle. Also, you can shard you queries in a way that each server will have different part of the graph in memory.
When we migrated our data inside Neo4j, we actually migrated it inside a stand-alone server and then used the dump command and the restore command in order to copy the data to the causal cluster.
It’s important to note that the spec of the servers was the same for the stand-alone and for the causal\HA cluster: 20 CPU, 140GB RAM and SSD disks.
All the tests were done after the cache was warm (not after a service restart).
We set up 3 environment with the same data: stand-alone and causal cluster (3 nodes) and HA cluster (3 nodes) with Neo4j 3.1
We loaded data using load csv command – we loaded files with 7000-8000 rows in it and the load csv command creates 1000-1200 nodes and 28000-35000 relationships.
Just a reminder – we have a database with 2.8B nodes and 6.7B relationships and database size is over 750gb (180gb out of them are indices)
Case #1: Stand-Alone vs. Causal Cluster
This was a complete shock. The stand-alone server was actually x6 times faster than the causal cluster on writes (7 minutes vs. 42 minutes).
We tried to tune the causal cluster parameters:
It is generally preferred to change one at a time, to make sure you know which change made the difference:
1) Decrease the queue size (the number of messages waiting to be sent to other servers in the cluster) – it means we’ll commit more frequently and perhaps spend less time queuing up. The default is 64.
2) Increase the id allocations (the size of the ID allocation requests core servers will make when they run out of NODE IDs.) – the purpose was to have enough IDs and don’t need to request more in the middle of the batch (default is 1024)
Unfortunately, we didn’t see any significant improvement. This was actually when we decided to switch from causal to HA cluster.
Case #2: IOPS vs. Memory
We executed the test on a stand-alone server with 8 CPU, 56gb RAM and SSD disks.
We loaded data using load csv command (see Intro)
The execution time was 36 minutes, per file.
When looking on OS graphs, we can see that the CPU utilization was iowait 42% (on average), which seems strange.
But when looking on IOPS – we can see that there were reads – but average of ~750 IOPS is nothing for SSD.
We did tests the disks and we were able to get ~35000 IOPS, so it’s not the disks…
A look on Neo4j Page Cache graph: looks like each time we were hitting the cache we got page fault and we had a lot of evictions.
We thought that it’s strange so we decided to scale the machine to 20 CPU, 140gb RAM. Same disks.
The load csv command execution time was 7 minutes (reminder: before it was 36 minutes)
when looking at the graphs we can see that we got average of 15000 IOPS and this time the page cache was x40 higher (which means, Neo is working faster)
Interesting to see that CPU iowait was 13.43% on a 20 CPU machine which equals to ~33.5% on a 8 CPU machine
So looks like there is a correlation between the total RAM size and disks IOPS.
Case #3: Causal Cluster vs. HA Cluster
It’s important to note that Causal cluster is the new cluster architecture and for now Neo4j supports the HA cluster, but it’s not on their roadmap and possibly will be removed in future versions.
Since we experienced such bad performance when using the Causal cluster, we decided to test the HA cluster (a stand-alone is not an option for production for us).
The execution time of load csv command is 18 minutes.
We were advised to change push&pull parameters:
ha.tx_push_factor = 0. The push factor determines the number of slaves to actively push to during a commit. For one-time loading that’s ok, but we won’t recommend to set this parameter to 0 on prod since you might have branching data and this is not something we want (see more on Neo4j worst practices)
ha.pull_interval = 1
This will determine the frequency of pulling updates from the master.
After changing the above parameters, the execution time is 11 minutes, which is not as good as a stand-alone server, but better than the Causal cluster.