API deadlock in Production but not in Development! This is driving me INSANE?

MartinHBramwell · August 25, 2020, 3:47pm

If the invoice number provided is correct rslt will be your typical API response to a call like:

curl GET 'http://erp.erpnext.host:8000/api/resource/Sales%20Invoice/001-002-000002683' -H 'Authorization: token c264badbadbad2bfe:bd8badbadbad074'

This happens whether or not the deadlock occurs!!!

When launching with supervisorctl start all the process completes successfully but after a 120 second delay

When launching with bench start and node ./bin/www.js the process completes successfully with a +/- 1 second delay.

adamtang79 · August 25, 2020, 3:50pm

Will the curl works without delay under supervisor?

adamtang79 · August 25, 2020, 3:55pm

Means after 120 seconds you get correct result inside rslt?

adamtang79 · August 25, 2020, 4:17pm

Mind to share your supervisor config here?

MartinHBramwell · August 25, 2020, 4:22pm

Checking that right now.

MartinHBramwell · August 25, 2020, 4:22pm

Yes. That is what happens.

MartinHBramwell · August 25, 2020, 5:01pm

In order to “equalize” the comparison, I created and tested with a Procfile that launches NodeJs the same way that supervisorctl launches it, so…

Procfile


redis_cache: redis-server config/redis_cache.conf
redis_socketio: redis-server config/redis_socketio.conf
redis_queue: redis-server config/redis_queue.conf

web: bench serve --port 8000

socketio: /usr/bin/node apps/frappe/socketio.js
my_svc: /home/erpdev/frappe-bench/apps/my_app/services/my_svc/go.sh

watch: bench watch

schedule: bench schedule
worker_short: bench worker --queue short 1>> logs/worker.log 2>> logs/worker.error.log
worker_long: bench worker --queue long 1>> logs/worker.log 2>> logs/worker.error.log
worker_default: bench worker --queue default 1>> logs/worker.log 2>> logs/worker.error.log

supervisor.conf

; Notes:
; priority=1 --> Lower priorities indicate programs that start first and shut down last
; killasgroup=true --> send kill signal to child processes too

[program:frappe-bench-my_svc]
command=home/erpdev/frappe-bench/apps/my_app/services/my_svc/go.sh
priority=3
autostart=true
autorestart=true
stdout_logfile=/home/erpdev/frappe-bench/logs/my_svc.log
stderr_logfile=/home/erpdev/frappe-bench/logs/my_svc.error.log
user=erpdev
directory=/home/erpdev/frappe-bench/apps/my_app/services/my_svc


[program:frappe-bench-frappe-web]
command=/home/erpdev/frappe-bench/env/bin/gunicorn -b 127.0.0.1:8000 -w 1 -t 120 frappe.app:application --preload
               :               :               :               : 
               :               :               :               :

They give identical results, but the latter introduces a 120 second delay.

MartinHBramwell · August 25, 2020, 5:28pm

Ok.
Steps I took:

Try the curl command. I get a response in less than a second.
Initiate the process from the client. No apparent activity. Logs show NodeJs waiting on ERPNext API.
Try the curl command again. It hangs, waiting
120 seconds later, the curl command and the ERPNext client get correct responses simultaneously!!!

MartinHBramwell · August 25, 2020, 5:52pm

I set them to 8, instead of 1

It made no difference at all.

adamtang79 · August 25, 2020, 7:29pm

Change -w 1 into
-w 2

Or 3 or 5 or 8

MartinHBramwell · August 25, 2020, 8:05pm

J christ on hot buttered toast! You’re a genius!

Now, I have to understand how that got in there, 'coz looking at a backup of a different site configuration I see that it has -w 4.

Muzzy · August 25, 2020, 8:12pm

My long shot guess is that when you initially set up ERPNext it created only 1 based on your machine core specification. My statement maybe completely incorrect.

MartinHBramwell · August 25, 2020, 8:17pm

Yeah, I’m killing the machine right now and rerunning my installation scripts. It’s a pay by the month KVM virtual host “in da cloud”. I’ll be very displeased if it’s single threaded or something.

I’ll let you know.

Joseph_Marie_Alba1 · August 25, 2020, 11:06pm

Your problem is in the parameter -t 120. This is too short and your workers get killed before the have the time to respond. Here is the gunicorn manual on this:

https://docs.gunicorn.org/en/stable/settings.html

timeout

-t INT, --timeout INT
30

Workers silent for more than this many seconds are killed and restarted.

Generally set to thirty seconds. Only set this noticeably higher if you’re sure of the repercussions for sync workers. For the non sync workers it just means that the worker process is still communicating and is not tied to the length of time required to handle a single request.

I think the reason why you get a response after 120 seconds is in -t 120.

6000 makes the session timeout around 1.6 hours. This may be a high number, If -t 6000 works, you may try lowering the session timeout.

adamtang79 · August 25, 2020, 11:25pm

I have not yet tested below statements.

Since there are only 1 worker and default is sync worker. Then 3 is waiting 1 and 1 is expired 120. Then 3 get executed. So, need to raise workers.

Joseph_Marie_Alba1 · August 25, 2020, 11:44pm

Yes. The point of using gunicorn is to have workers in production mode. The issue here is, after receiving the task, the worker passes it to the server (ERPNext python) and waits. Then, it gets caught by the timeout. Since there is only 1 worker, you have to wait for the new worker to get timedout before you can get back the answer after 120 seconds.

So, the key is to increase the session -t timeout parameter to give it enough time.

You see this happen during the ERPNext version 12 start-up setup. bench start development mode does not get the request timeout issue. However, the production mode (bench restart) mode gets request timeout problems which can be solved by increasing the http_timeout parameter using bench command.

MartinHBramwell · August 26, 2020, 1:44am

I am not understanding why you say this.

With -w 1 I had a deadlock that took 120 seconds to resolve.
With -w 4 I no longer have deadlock and everything processes in about 1 second.

Why would I want to, stay with a single worker (-w 1) and, increase the deadlock to 1.6 hours?

It seems obvious that the root cause is the attempt at two-way communication with only one worker. The “key” was to Increase the number of workers from 1 to 4.

Joseph_Marie_Alba1 · August 26, 2020, 2:21am

Ok good. The recommended number of gunicorn workers is number_of_cores * 2 + 1.
So for a 4 core CPU the recommended setting is 9. For 2 core it is 3.

With regards to -t setting, the reason why you need to wait for 120 seconds when you had only one worker is that gunicorn killed the worker which was waiting for the API server response.

I think - here I am guessing - to have a precise timeout setting, you can try setting -t to a higher timeout number until you no longer have a timeout problem with one worker only. When you have this precise timeout setting, you can increase your gunicorn workers. Then you have an efficient system that does not kill workers which are just about ready to receive the API Server response.