I am trying to get ERPNext working on my Kubernetes cluster. For persistence, I am using NFS and Longhorn (not really interesting for the issue I am gonna show, I think).
I used no custom values AT ALL, just the same simple commands shown by Helm chart.
Describeing the faulty pod (frappe-bench-erpnext-socketio-746c69bcd4-czjkl) gives:
Warning Unhealthy 6m28s (x3 over 6m48s) kubelet Liveness probe failed: dial tcp 172.16.0.84:9000: connect: connection refused
Normal Killing 6m3s kubelet Container socketio failed liveness probe, will be restarted
Warning Unhealthy 5m48s (x8 over 6m48s) kubelet Readiness probe failed: dial tcp 172.16.0.84:9000: connect: connection refused
Normal Started 5m47s (x2 over 6m58s) kubelet Started container socketio
Normal Pulled 5m38s (x3 over 6m59s) kubelet Container image "frappe/frappe-socketio:v14.13.0" already present on machine
Normal Created 5m38s (x3 over 6m59s) kubelet Created container socketio
At the moment other pods are restarting and are faulty, but I am not going to show them now, because they are probably failing because the previously mentioned one fails.
And after waiting about half and hour, the “steady” state of get pods looks like this:
why does conf-bench job failed? what’s the pod log for conf-bench pod.
once conf bench is successful, common_site_config.json and apps.txt will be created. that configures all other pods to pick up proper hosts for redis and db
# kubectl logs -n erpnext -f frappe-bench-erpnext-conf-bench-20221103212650-rwjdx
Defaulted container "configure" out of: configure, frappe-bench-ownership (init)
failed to create fsnotify watcher: too many open files
I am not if k3d environment is different from vanilla Kubernetes on Hetzner.
Probably to confirm that, I need to deploy a completely fresh and new cluster and try to deploy ERPNext on it. To be honest, I don’t think this is going to make any difference, because my current environment is controller and nothing “unuseful” is deployed on it. Though, I will try to deploy a completely new K8s cluster and try if it is going to work.
This error is only because of using logs with -f on a “dead” container, apparently, so we can forget it now, because without using -f I do not get that.
I have just finished a fresh ERPNext deployment on fresh nodes, I still have issues. The current state look like this:
All worker pods (that are shown as failed) show the following (or very similar);
root@control-plane-01:~# kubectl logs -n erpnext frappe-bench-erpnext-worker-s-d4f7c8fbb-h2sfq
Defaulted container "short" out of: short, populate-assets (init)
15:13:12 Worker rq:worker:0dea130994db435bb0946302f797d00c.frappe-bench-erpnext-worker-s-d4f7c8fbb-h2sfq.7.home-frappe-frappe-bench:short: started, version 1.10.1
15:13:12 Subscribing to channel rq:pubsub:0dea130994db435bb0946302f797d00c.frappe-bench-erpnext-worker-s-d4f7c8fbb-h2sfq.7.home-frappe-frappe-bench:short
15:13:12 *** Listening on home-frappe-frappe-bench:short...
15:13:12 Cleaning registries for queue: home-frappe-frappe-bench:short
The scheduler pod shows no logs at all, just empty.
The events in describe of scheduler shows:
Normal Pulling 17m kubelet Pulling image "frappe/erpnext-worker:v14.5.1"
Normal Pulled 17m kubelet Successfully pulled image "frappe/erpnext-worker:v14.5.1" in 30.178706426s
Normal Created 17m (x3 over 17m) kubelet Created container scheduler
Normal Started 17m (x3 over 17m) kubelet Started container scheduler
Warning BackOff 14m (x9 over 17m) kubelet Back-off restarting failed container
Normal Pulled 4m30s (x34 over 17m) kubelet Container image "frappe/erpnext-worker:v14.5.1" already present on machine
The events in describe of one worker shows:
Normal Pulled 19m kubelet Successfully pulled image "frappe/erpnext-nginx:v14.5.1" in 16.703634324s
Normal Created 19m kubelet Created container populate-assets
Normal Started 19m kubelet Started container populate-assets
Normal Created 19m (x2 over 19m) kubelet Created container short
Normal Started 18m (x2 over 19m) kubelet Started container short
Warning BackOff 18m kubelet Back-off restarting failed container
Normal Pulled 93s (x57 over 19m) kubelet Container image "frappe/erpnext-worker:v14.5.1" already present on machine
I cannot see what the can be. @revant_one am I missing anything?
I’ve not used Hetzner VMs for K8s setup. I’ve no idea.
I’ve most of the time interacted with managed clusters and k3d for local testing/development.
I’ll need to interact with the cluster with kubectl to find out.
If you can let me access your cluster using kubectl I can try to find out what is going on.
I still have, hopefully, a final issue in this phase. When I try to access erpnext over ingress or kubectl port-forward, I still get 404 error. Do I still need to do any further steps? So apparently I can access the ingress controller, but there is no backend from the side of ERPNext?
By the way, why do I have an nginx pod in erpnext namespace? Could that be the issue? Do I really need it? Because I am running my own ingress container.
UPDATE
I tried to create site as shown here:
but still have the same 404 error.
I sent you the kubeconfng (admin.conf) as a message.
As shown above in a message from @revant_one, the solution to the ingress issue to apply the new site job in the correct namespace, which I was not doing. Thank you!
@revant_one Maybe you can post an answer with that content so that I can accept it as a solution.
I am just curious to know if it is “normal” that some pods keeps restarting (and after a lot of restarts recreated) until some when they succeed. Apparently these pods keeps restarting until other pods are ready.
This whole process with some not necessary restart takes about 5 minutes. Can’t these pods wait until the other pods are ready? Or is it absolutely not an issue that they are just restarted/recreated until later in the future they work?
Here we can see frappe-bench-erpnext-socketio-7fc67ccf9f-56dgs failed to start, and thus it was recreated, and that 2nd copy of it is ready, but the original one failed.
Similar unnecessary initial restarts happened also to the workers.