My two Raspberry Pi high availability contraption has been running the replication routine for around four days now. So far so good. However (after some more reading) there are things that need tweaking / changing in order to make it more reliable in the long run.
NFS Server: To keep the setup to bare bones this has to be replaced with the already installed SSH Server on both boards. I can utilise the ability to execute remote commands so that this will hopefully remove any of the issues that NFS could have especially when we are talking about SQLite as the backend. And as a byproduct the replication will happen over a secure link.
Split Brain Healing: There is no automatic procedure that kicks in when the Master node is up and running after it has failed. I am also not sure if I really want to have this implemented due to several reasons. At the moment when Node1 (Master) goes down then Node2 (Slave) takes over and that is it. If Node1 gets back online it does NOT get promoted back to the Master. If the Slave fails the Master just keeps on doing its thing. A manual heal might be a preferred option considering you would probably need to sync the databases yourself. Unless you really do not care about few seconds worth of lost stuff. Will re-evaluate this later.
Heartbeat Monitoring: keepalived is being deployed to provide with the floating IP which does the job rather well. What I need is to monitor and do a failover when for example the Apache Web Server stops working on one of the nodes. No web server = BAD = IoT devices can't send any data to the database over the network.
Node Down Notifications: Currently you will not have a clue in the world if one of the Raspberry Pi boards has failed or not. A simple email alert potentially to make you aware of this event might prove to be rather useful me thinks. Or.. in a smart house maybe all the lights would turn red for a rather obvious indicator to what has happened (that would be rather neat).
SQLite issues: When syncing the database on the Slave node things can get messy and INSERT and UPDATE routines can fail (and the ones on the Master to be frank - but you will see those via the web interface in this case). But you would never notice if you would not know about those. Design a way to let the user know that for some reason this is happening might be a good idea for the general health and wellbeing of the HA setup, I think you would agree.
Miscellaneous: One example would be to increase the interval at which keepalived monitors the health of the nodes. At the moment I am running it at one second but realistically five seconds would do fine as well.
Once I get these things implemented it should make for a much more robust solution to replicate SQLite over a network that should be good enough to be implemented in a high availability heart for a smart home 🙂