Before we start, this is the definition of k-safety from Vertica documentations:
K-safety is a measure of fault tolerance in the database cluster. The value K represents the number of replicas of the data in the database that exist in the database cluster. These replicas allow other nodes to take over for failed nodes, allowing the database to continue running while still ensuring data integrity. If more than K nodes in the database fail, some of the data in the database may become unavailable. In that case, the database is considered unsafe and automatically shuts down.
It is possible for an HP Vertica database to have more than K nodes fail and still continue running safely, because the database continues to run as long as every data segment is available on at least one functioning cluster node. Potentially, up to half the nodes in a database with a K-safety level of 1 could fail without causing the database to shut down. As long as the data on each failed node is available from another active node, the database continues to run.
From our own experience of k-safety:
Most of other HPE Vertica customers are using k-safe=1. K-safety=1 is the recommended K-safety value by HPE Vertica, and when we opened a ticket to support asking about it – they wrote us and emphasized that - if you’re using K-safety = 2, you should lower it to K-safety = 1.
Let's look at the Pros & Cons of increasing the k-safety:
- Increase fault tolerance, since we have more replicas of the data, HPE Vertica can run normally if any two nodes fail (potentially even more – as long as at least one other node in the cluster has a copy of the dialed nodes’s data).
- Increased time it takes to load data – since we need to update all replicas – and now you have 3 instead of 2
- Increased storage – you will save more replicas of the data (3 instead of 2)
- Increased recovery time – the time it takes to recover a node is increased
- When adding or removing a node from the cluster – the rebalance process will take longer
- Backup time and storage
From the reasons mentioned above - it's best practices to have k-safety set to 1.
Back to our story – in our case, we needed to migrate our Vertica cluster (QA!) to another data center and it was k-safety=0 (since we wanted to save storage), so we needed to change it to k-safety=1 so we could do the migration online without shutting down to entire cluster.
So this is how we’ve done it:
- At first, you’ll need to increase the k-safety:
- The output of this command is important – it’s actually the list of all projection which are missing the buddy projections.
Current design does not meet the requirements for K = 1.
Current design is valid for K = 0
Projection XXXX has insufficient buddy projections; it has 0 buddies.
- creating buddy projections
- You can view the projection of a table using get_projection function (it will also let you know if these projections has buddy projections or not)
- create projection command
- Since we had ~200 projection with missing buddy projection – we used this query – which generated create projection statements:
select ' create projection ' || projection_schema||'.'|| anchor_table_name || '_b1 as select * from '|| projection_schema||'.'|| anchor_table_name||
' segmented by ' || segment_expression||' all nodes offset 1;'
from projections where projection_name like '%_super'
and projection_schema||'.'|| anchor_table_name || '_b1' not in (select distinct projection_schema||'.'||projection_name from projections where projection_name not like '%_super')
- The _b1 at the end of a projection is the standard for adding budding projection
- If a projection already has a buddy projection – the command will fail (just this one, so it’s ok).
- We assumed that there’s a segmented by clause for each projection – but in case it’s null – the create statement will fail. This can be solved easily by adding a decode/case statement that checks if there’s a segment_expression value inside projections before adding segmented by clause to the create projection statement
- The next step is to refresh the new projections with data:
- Now you should execute SELECT MARK_DESIGN_KSAFE(1); again and the output should be:
dbadmin=> SELECT MARK_DESIGN_KSAFE(1);
Marked design 1-safe
- That’s it.