I am running ERPNext on Kubernetes with an external MySQL server and Redis. I have monitored MySQL performance metrics, and there are no signs of issues.
To further investigate, I integrated OpenTelemetry for tracing. As shown in the attached screenshot, Redis and MySQL performance appear to be fine. However, I’m experiencing intermittent slow responses when calling a single backend endpoint.
For example, in Jaeger traces, a simple login request sometimes takes an unexpectedly long time, despite no apparent database or Redis bottlenecks.
What else can I check? Could cache expiration be causing slow page loads when the cache is invalidated? Any insights or debugging tips would be appreciated.
I have a dedicated application for telemetry and have configured OpenTelemetry accordingly.
The main issue I’m facing is that when no one has used the app for a while, the first page load takes an extremely long time. Additionally, I suspect that when the cache is invalidated, page loads become significantly slower.
From the OpenTelemetry traces, Redis and MySQL respond within milliseconds, yet the overall page load time remains high.
Our Kubernetes cluster performs well in other applications across the company, and this issue occurs only in ERPNext. Additionally, we experience the same problem in the test environment.
What could be causing this? Are there specific areas I should investigate to optimize the initial load time?
I can see it’s 8 sec. What is taking 8 sec? Which db query or redis query or jinja rendering?
I’ve seen frappe apps (no erpnext) at hyperscale. The team that made frappe apps made sure no request is taking more than 3sec. It was done using traces and basics discussed by Ankush during v14 release. (youtube video available)
It is important to know internal traces to decide if code/queries need changes or setup/configuration needs to be improved.
After really long debugging sessions we found out that we are having latencies between our kubernetes nodes. So after fixing network problems our instance started work really fast ( arround 10x more).