I would like to have a discussion about best practices for a high availability production deployment of ERPNext. This is because besides our customers who are viewing or sending orders, we also have two factories depending on having the system up and running.
We currently have ERPNext (bench + redis + mariadb, etc.) running on a bare metal Ubuntu linux server that is located in a datacenter. We do regular backups but if there was a catastrophic server failure we would need to set up a new server and use our backups of the files and the DB. The downtime would probably be about 1 hour if everything goes well. Instead of this I would like to set up a highly available webservice which ensures that in case of a server failure or internet connection problems, we can still run our company.
Our internet connection is unfortunately not 100% reliable. We have written a connector app which is reading and writing information between ERPNext and our assembly machines. If the connection to the ERPNesxt server is not possible (e.g. internet connection is down) our machines are down! We are solving this through a firewall with failover and two separate internet connections, however I have had it happen last year that both ISP were down for about 3 hours during a thunderstorm. At our second facility we have had an internet outage of both providers due to a truck crashing into the telephone lines, ripping down the DSL and fibreglass connection.
MariaDB Cluster
- Set up a MariaDB cluster with 3 nodes (web, factory 1, factory 2)
- Install Bench on 3 bare metal machines (web, factory 1, factory 2)
- Install a load balancer that distributes between the 3 servers depending on availability or just access the local servers by hostname like www.company.com, factory1.company.com and factory2.company.com
This should be possible with bench and the framework, does somebody have experience with a setup of two servers running bench and a MariaDB Galera Cluster? I want to prevent a Master-Slave setup. The downside would definitely be that we would need to manually bench update every server.
Another question would be how to synchronize the private and public files.
Docker
- Set up a node cluster online
- Set up 3 nodes in the availability zones (web, factory1, factory 2)
- Separate the services that bench starts in docker
- Deploy the stack
We would have to make sure the containers in the different availability zones will work when cut off from each other.
Final Thought
I don’t want to jump on a docker-hype and no, we are not google so we don’t need to scale for millions of users. I just want to think about how to have a system that will run in case of failure.
Regardless of our specific case I think a HA setup of ERPNext that is scaleable and will survive server failure without user experience problems is an interesting discussion to be had. These are just some ideas, I would appreciate some input from people who have similar experiences!