ERPNext helm on Vanilla K8s does not work after fresh Helm chart deployment

Hello,

I am trying to get ERPNext working on my Kubernetes cluster. For persistence, I am using NFS and Longhorn (not really interesting for the issue I am gonna show, I think).

I used no custom values AT ALL, just the same simple commands shown by Helm chart.

The current state of pods I got looks like this:

Every 1.0s: kubectl get pods -n erpnext                                                                                                                                                                                                             control-plane-01: Thu Nov  3 00:48:30 2022

NAME                                                   READY   STATUS             RESTARTS        AGE
frappe-bench-erpnext-conf-bench-20221103004411-tswbg   0/1     Init:0/1           0               3m36s
frappe-bench-erpnext-gunicorn-7799994665-q8vp4         0/1     Running            1 (41s ago)     3m36s
frappe-bench-erpnext-nginx-c68c85bd9-xpd7q             1/1     Running            1 (2m42s ago)   3m36s
frappe-bench-erpnext-scheduler-d97548495-rbwht         0/1     Running            1 (50s ago)     3m36s
frappe-bench-erpnext-socketio-746c69bcd4-czjkl         0/1     CrashLoopBackOff   2 (50s ago)     3m36s
frappe-bench-erpnext-worker-d-67fd779b5b-7hpc5         0/1     Running            0               3m36s
frappe-bench-erpnext-worker-l-6c7fb9c444-gmbgd         0/1     Init:0/1           0               3m36s
frappe-bench-erpnext-worker-s-5cf9556786-j9qx5         0/1     Init:0/1           0               3m36s
frappe-bench-mariadb-0                                 1/1     Running            0               3m36s
frappe-bench-redis-cache-master-0                      1/1     Running            0               3m36s
frappe-bench-redis-queue-master-0                      1/1     Running            0               3m36s
frappe-bench-redis-socketio-master-0                   1/1     Running            0               3m36s

Describeing the faulty pod (frappe-bench-erpnext-socketio-746c69bcd4-czjkl) gives:

 Warning  Unhealthy       6m28s (x3 over 6m48s)  kubelet  Liveness probe failed: dial tcp 172.16.0.84:9000: connect: connection refused
  Normal   Killing         6m3s                   kubelet  Container socketio failed liveness probe, will be restarted
  Warning  Unhealthy       5m48s (x8 over 6m48s)  kubelet  Readiness probe failed: dial tcp 172.16.0.84:9000: connect: connection refused
  Normal   Started         5m47s (x2 over 6m58s)  kubelet  Started container socketio
  Normal   Pulled          5m38s (x3 over 6m59s)  kubelet  Container image "frappe/frappe-socketio:v14.13.0" already present on machine
  Normal   Created         5m38s (x3 over 6m59s)  kubelet  Created container socketio

At the moment other pods are restarting and are faulty, but I am not going to show them now, because they are probably failing because the previously mentioned one fails.

And after waiting about half and hour, the “steady” state of get pods looks like this:

frappe-bench-erpnext-conf-bench-20221103004411-tswbg   0/1     Error                    0               29m
frappe-bench-erpnext-gunicorn-7799994665-q8vp4         1/1     Running                  5 (10m ago)     29m
frappe-bench-erpnext-nginx-c68c85bd9-k6vlz             1/1     Running                  1 (12m ago)     19m
frappe-bench-erpnext-nginx-c68c85bd9-xpd7q             0/1     ContainerStatusUnknown   4 (21m ago)     29m
frappe-bench-erpnext-scheduler-d97548495-rbwht         1/1     Running                  7 (9m44s ago)   29m
frappe-bench-erpnext-socketio-746c69bcd4-8ndxl         0/1     CrashLoopBackOff         7 (4m43s ago)   21m
frappe-bench-erpnext-socketio-746c69bcd4-czjkl         0/1     ContainerStatusUnknown   2 (27m ago)     29m
frappe-bench-erpnext-worker-d-67fd779b5b-7hpc5         0/1     CrashLoopBackOff         7 (4m50s ago)   29m
frappe-bench-erpnext-worker-l-6c7fb9c444-gmbgd         0/1     ContainerStatusUnknown   6 (4m46s ago)   29m
frappe-bench-erpnext-worker-l-6c7fb9c444-gsxwb         0/1     CrashLoopBackOff         3 (21s ago)     3m5s
frappe-bench-erpnext-worker-s-5cf9556786-j9qx5         0/1     CrashLoopBackOff         9 (3m4s ago)    29m
frappe-bench-mariadb-0                                 1/1     Running                  1 (9m16s ago)   14m
frappe-bench-redis-cache-master-0                      1/1     Running                  0               29m
frappe-bench-redis-queue-master-0                      1/1     Running                  0               29m
frappe-bench-redis-socketio-master-0                   1/1     Running                  0               29m

I am not sure what the issue is, could anybody helm me to get it working?

Please let me know if you need more info (e.g., more pods describe)

share kubeconfig or access.

why does conf-bench job failed? what’s the pod log for conf-bench pod.

once conf bench is successful, common_site_config.json and apps.txt will be created. that configures all other pods to pick up proper hosts for redis and db

# .kubeconfig

apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: REDUCTED
    server: https://10.0.0.10:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
  user:
    client-certificate-data: REDUCTED
    client-key-data: REDUCTED

I uninstalled ERPNext Helm then redeployed it. The behaviour is inconsistent, I get the following now:

kubectl get pods -n erpnext                                                                                                                                                                                                             control-plane-01: Thu Nov  3 21:41:33 2022

NAME                                                   READY   STATUS             RESTARTS        AGE
frappe-bench-erpnext-conf-bench-20221103212650-rwjdx   0/1     Completed          0               13m
frappe-bench-erpnext-gunicorn-7799994665-s2gcd         1/1     Running            0               13m
frappe-bench-erpnext-new-site-20221103212650-42qb5     0/1     Completed          0               13m
frappe-bench-erpnext-nginx-c68c85bd9-68rrv             0/1     Running            1 (5m31s ago)   13m
frappe-bench-erpnext-scheduler-d97548495-crdvd         0/1     CrashLoopBackOff   6 (7m39s ago)   13m
frappe-bench-erpnext-socketio-746c69bcd4-vbr8h         0/1     CrashLoopBackOff   6 (7m48s ago)   13m
frappe-bench-erpnext-worker-d-67fd779b5b-dr9zh         0/1     Running            5 (7m8s ago)    13m
frappe-bench-erpnext-worker-l-6c7fb9c444-l5cxl         0/1     CrashLoopBackOff   5 (8m7s ago)    13m
frappe-bench-erpnext-worker-s-5cf9556786-q5k7q         0/1     CrashLoopBackOff   5 (7m10s ago)   13m
frappe-bench-mariadb-0                                 1/1     Running            1 (9m46s ago)   13m
frappe-bench-redis-cache-master-0                      1/1     Running            0               13m
frappe-bench-redis-queue-master-0                      1/1     Running            0               13m
frappe-bench-redis-socketio-master-0                   1/1     Running            0               13m
# kubectl logs -n erpnext -f frappe-bench-erpnext-conf-bench-20221103212650-rwjdx

Defaulted container "configure" out of: configure, frappe-bench-ownership (init)
failed to create fsnotify watcher: too many open files

I searched for this:

https://www.google.com/search?q=failed+to+create+fsnotify+watcher%3A+too+many+open+files

first result Failed to create fsnotify watcher: too many open files - Drone Support - Harness Community

it suggests to tweak settings.

I’m using helm charts on production, others are also using it. I can run the k3d setup like one in tests any time and it works.

I am not if k3d environment is different from vanilla Kubernetes on Hetzner.

Probably to confirm that, I need to deploy a completely fresh and new cluster and try to deploy ERPNext on it. To be honest, I don’t think this is going to make any difference, because my current environment is controller and nothing “unuseful” is deployed on it. Though, I will try to deploy a completely new K8s cluster and try if it is going to work.

I will keep you updated. Thank you!

This error is only because of using logs with -f on a “dead” container, apparently, so we can forget it now, because without using -f I do not get that.

I have just finished a fresh ERPNext deployment on fresh nodes, I still have issues. The current state look like this:

erpnext            frappe-bench-erpnext-conf-bench-20221105160501-ttsxw     0/1     Completed              0             20m
erpnext            frappe-bench-erpnext-gunicorn-869795cd57-8grrj           1/1     Running                0             20m
erpnext            frappe-bench-erpnext-nginx-cf6bdbc97-kblkx               1/1     Running                0             20m
erpnext            frappe-bench-erpnext-scheduler-7d75c9cbb7-zmfcv          0/1     CreateContainerError   3 (11m ago)   20m
erpnext            frappe-bench-erpnext-socketio-d4cb96f7f-jnjfq            1/1     Running                3 (12m ago)   20m
erpnext            frappe-bench-erpnext-worker-d-59b69ccc48-r6d4w           0/1     CreateContainerError   2 (11m ago)   20m
erpnext            frappe-bench-erpnext-worker-l-6cc4c8cd4d-l9srm           0/1     CreateContainerError   2 (11m ago)   20m
erpnext            frappe-bench-erpnext-worker-s-d4f7c8fbb-h2sfq            0/1     CreateContainerError   2 (11m ago)   20m
erpnext            frappe-bench-mariadb-0                                   1/1     Running                0             20m
erpnext            frappe-bench-redis-cache-master-0                        1/1     Running                0             20m
erpnext            frappe-bench-redis-queue-master-0                        1/1     Running                0             20m
erpnext            frappe-bench-redis-socketio-master-0                     1/1     Running                0             20m

All worker pods (that are shown as failed) show the following (or very similar);

root@control-plane-01:~# kubectl logs -n erpnext frappe-bench-erpnext-worker-s-d4f7c8fbb-h2sfq
Defaulted container "short" out of: short, populate-assets (init)
15:13:12 Worker rq:worker:0dea130994db435bb0946302f797d00c.frappe-bench-erpnext-worker-s-d4f7c8fbb-h2sfq.7.home-frappe-frappe-bench:short: started, version 1.10.1
15:13:12 Subscribing to channel rq:pubsub:0dea130994db435bb0946302f797d00c.frappe-bench-erpnext-worker-s-d4f7c8fbb-h2sfq.7.home-frappe-frappe-bench:short
15:13:12 *** Listening on home-frappe-frappe-bench:short...
15:13:12 Cleaning registries for queue: home-frappe-frappe-bench:short

The scheduler pod shows no logs at all, just empty.

The events in describe of scheduler shows:

 Normal   Pulling         17m                   kubelet  Pulling image "frappe/erpnext-worker:v14.5.1"
  Normal   Pulled          17m                   kubelet  Successfully pulled image "frappe/erpnext-worker:v14.5.1" in 30.178706426s
  Normal   Created         17m (x3 over 17m)     kubelet  Created container scheduler
  Normal   Started         17m (x3 over 17m)     kubelet  Started container scheduler
  Warning  BackOff         14m (x9 over 17m)     kubelet  Back-off restarting failed container
  Normal   Pulled          4m30s (x34 over 17m)  kubelet  Container image "frappe/erpnext-worker:v14.5.1" already present on machine

The events in describe of one worker shows:

 Normal   Pulled          19m                 kubelet  Successfully pulled image "frappe/erpnext-nginx:v14.5.1" in 16.703634324s
  Normal   Created         19m                 kubelet  Created container populate-assets
  Normal   Started         19m                 kubelet  Started container populate-assets
  Normal   Created         19m (x2 over 19m)   kubelet  Created container short
  Normal   Started         18m (x2 over 19m)   kubelet  Started container short
  Warning  BackOff         18m                 kubelet  Back-off restarting failed container
  Normal   Pulled          93s (x57 over 19m)  kubelet  Container image "frappe/erpnext-worker:v14.5.1" already present on machine

I cannot see what the can be. @revant_one am I missing anything?

I’ve not used Hetzner VMs for K8s setup. I’ve no idea.
I’ve most of the time interacted with managed clusters and k3d for local testing/development.

I’ll need to interact with the cluster with kubectl to find out.

If you can let me access your cluster using kubectl I can try to find out what is going on.

1 Like

Thank you for being ready to access my cluster, actually I wanted to suggest that but did not know if you really have time for it. Thank you!

I did a manual restart for all pods in erpnext namespace:

kubectl -n erpnext rollout restart deploy

And after that I got all pods working! I don’t know how, but does it mean that we have a race condition here? Anyway, this is a good point!

erpnext            frappe-bench-erpnext-conf-bench-20221105172621-8c5qg     0/1     Completed   0              97m
erpnext            frappe-bench-erpnext-gunicorn-648b895cb6-xj646           1/1     Running     0              146m
erpnext            frappe-bench-erpnext-nginx-9f87c68fd-k7s9p               1/1     Running     0              146m
erpnext            frappe-bench-erpnext-scheduler-859f9578c6-74fpw          1/1     Running     0              146m
erpnext            frappe-bench-erpnext-socketio-7dbb77668d-wf5z8           1/1     Running     0              146m
erpnext            frappe-bench-erpnext-worker-d-6c959b894f-fcdrq           1/1     Running     0              146m
erpnext            frappe-bench-erpnext-worker-l-5d5b54c6b5-fdgbc           1/1     Running     0              146m
erpnext            frappe-bench-erpnext-worker-s-866bfc95f5-mfr72           1/1     Running     0              146m
erpnext            frappe-bench-mariadb-0                                   1/1     Running     0              178m
erpnext            frappe-bench-redis-cache-master-0                        1/1     Running     0              178m
erpnext            frappe-bench-redis-queue-master-0                        1/1     Running     0              178m
erpnext            frappe-bench-redis-socketio-master-0                     1/1     Running     0              178m

I still have, hopefully, a final issue in this phase. When I try to access erpnext over ingress or kubectl port-forward, I still get 404 error. Do I still need to do any further steps? So apparently I can access the ingress controller, but there is no backend from the side of ERPNext?

By the way, why do I have an nginx pod in erpnext namespace? Could that be the issue? Do I really need it? Because I am running my own ingress container.

UPDATE

I tried to create site as shown here:

but still have the same 404 error.

I sent you the kubeconfng (admin.conf) as a message.

Have you created ingress for site? helm/README.md at main · frappe/helm · GitHub

Yes sure, I tried using both Ingress and port-forward, even trying to access the pod IP from inside the cluster, all give the 404 error.

I did not create the ingress manually, I just activated that in the values.yaml.

As shown above in a message from @revant_one, the solution to the ingress issue to apply the new site job in the correct namespace, which I was not doing. Thank you!

@revant_one Maybe you can post an answer with that content so that I can accept it as a solution.

1 Like
kubectl -n erpnext apply -f job.yml

create the job resource in correct namespace.

1 Like

Thanks a lot, @revant_one!

I am just curious to know if it is “normal” that some pods keeps restarting (and after a lot of restarts recreated) until some when they succeed. Apparently these pods keeps restarting until other pods are ready.

This whole process with some not necessary restart takes about 5 minutes. Can’t these pods wait until the other pods are ready? Or is it absolutely not an issue that they are just restarted/recreated until later in the future they work?

erpnext            frappe-bench-erpnext-conf-bench-20221107212541-dkcfc     0/1     Completed                0               5m35s
erpnext            frappe-bench-erpnext-gunicorn-74df6f654f-ngkqk           1/1     Running                  0               5m35s
erpnext            frappe-bench-erpnext-new-site-20221107212541-4nh6l       0/1     Completed                0               5m35s
erpnext            frappe-bench-erpnext-nginx-dbd56cf87-45mjg               1/1     Running                  0               5m35s
erpnext            frappe-bench-erpnext-scheduler-759f6dbfb9-m8qjq          1/1     Running                  4 (3m39s ago)   5m35s
erpnext            frappe-bench-erpnext-socketio-7fc67ccf9f-56dgs           0/1     ContainerStatusUnknown   5 (3m43s ago)   5m35s
erpnext            frappe-bench-erpnext-socketio-7fc67ccf9f-qqrpd           1/1     Running                  0               60s
erpnext            frappe-bench-erpnext-worker-d-64d98c9cc6-gp55b           1/1     Running                  5 (64s ago)     5m35s
erpnext            frappe-bench-erpnext-worker-l-69fcd768d-d4tfx            1/1     Running                  4 (62s ago)     5m35s
erpnext            frappe-bench-erpnext-worker-s-5d749d4854-sx2bq           1/1     Running                  4 (65s ago)     5m35s
erpnext            frappe-bench-mariadb-0                                   1/1     Running                  1 (88s ago)     5m35s
erpnext            frappe-bench-redis-cache-master-0                        1/1     Running                  0               5m35s
erpnext            frappe-bench-redis-queue-master-0                        1/1     Running                  0               5m35s
erpnext            frappe-bench-redis-socketio-master-0                     1/1     Running                  0               5m35s

Here we can see frappe-bench-erpnext-socketio-7fc67ccf9f-56dgs failed to start, and thus it was recreated, and that 2nd copy of it is ready, but the original one failed.

Similar unnecessary initial restarts happened also to the workers.