[FYI/"Workable"] Stock Entry (STE-) submit timeouts in Production

trentmu · March 24, 2019, 2:01pm

I previously created this post. This is a continuation, and possibly/hopefully an improvement - on that one.

I use…
ERPNext: v10.1.81 (898977b) (v10.x.x)
Frappe Framework: v10.1.71-8 (92dc1e0) (v10.x.x)
on a Debian stretch 9.x manually installed setup - i7 with 16GB RAM on an SSD
The DB parameters are “optimised” as much as possible within my knowledge for performance
The hostnames are in the /etc/hosts file to ensure it isn’t a DNS issue
The “user” PC is a debian LXDE i5 with 16GB RAM using chromium browser (behaviour is same in FF)

Problem:

When I submit certain (weirdly specific) STE requests, they would timeout. No error…just a timeout, and the doctype doesn’t get submitted.

Method:

I changed the timeouts, increased all sorts of parameters and retried lots of times.

Result:

A workable workaround which (for my scenario) is adequate.
- Not ideal, but OK.

Observations:

Although I now have 8x gunicorn threads, only ONE seems to be doing any work.
The CPU is an i7 - only 1 of its cores appears to be working (~90% in top, but in the GUI it shows ~10-13%)
As I submit more STE-#'s, it does appear to get marginally quicker on each subsequent submit
- possibly it still has cached copies of some of the code or something??
While the mem usage is relatively low, the response is also low.
- eg. when the in use RAM is ~3GB for the overall system, there are a lot more timeouts.
- When the overall system is >5GB, there are less timeouts, and the submits are (quite a lot) quicker?
- If I set up several tabs on the browser, each “ready” to save/submit, and then submit the next one the moment the done “ting” sounds, then there are less timeouts too
  - this also leads me to believe that there is a cache effect somewhere
The timeout message appears after 2mins - no matter what I use for the -t parameter??

Adjusted parameters via

nano ~/frappe-bench/config/supervisor.conf

Increased the number of worker threads to 8 (from default of 2)
Increased timeout to 360s (from default of 60s. I used 360s initially, and this worked on some STE’s but there is one particular BOM type that still fails, so I upped it to 480[fail], 600[fail] and eventually what worked was 900 for those)
Changed listener to 0.0.0.0 (from default of 127.0.0.1) to ensure that I access whatever port feels like listening, just in case of DNS type annoyances as outlined in the one section of the old post…

and updated/reloaded it with

sudo supervisorctl reread
sudo systemctl restart supervisor

I now have

ps -ef | grep guni

/home/frappe/frappe-bench/env/bin/python /home/frappe/frappe-bench/env/bin/gunicorn -b 0.0.0.0:8000 -w 8 -t 360 frappe.app:application --preload --limit-request-line 0

This allows me to (albeit slowly) submit them correctly. So, to summarise, what is working is as follows:

Save/Submit the STE
When it times out…just LEAVE IT untouched…it continues in the BG
If the timeouts are adequate, then it will finish its submit process - as long as you don’t mess with it.
The CPU utilisation will remain at the ~10%
- depends what you have in there - (i7 ~10%, i5 ~35% etc) until the process finishes or properly times out or bombs
- If the CPU drops back to very low (ie <5%) and the screen doesn’t yet show “Submitted”, then the timeout was NOT adequate, and the process is incomplete
  - you’ll need to increase the timeouts further, and restart the supervisor processes like this
    - nano ~/frappe-bench/config/supervisor.conf and increase the -t 60 to a higher number, like say 900 for the horrendously slow stuff
    - sudo systemctl restart supervisor
      - I found that using this
        ⇒ supervisorctl reread
        ⇒ supervisorctl stop all
        ⇒ supervisorctl start all
        
        did not load the new/adjusted parameters, while restarting the entire process did
The “workable” submit - which eventually finished after ~13mins of nothingness

ESV%20-%20finally%20submitted%20-%20Screenshot%20from%202019-03-24%2016-02-59410×576 26.2 KB
- The 3rd submit attempt took “only” 6mins to complete, and the 4th one took only 5mins, so definitely got much faster

I do NOT know why this happens, nor how to actually fix it.

hopefully someone who knows what’s what will see these notes/observations, have a eureka moment and change something in the mainstream code which can address/fix it properly
- I have looked at Submit Stock Entry taking too long time which describes a similar problem, in that the transactions are backdated, but I am not using a script like in that post, and don’t know how to do such things (yet).
  - There is an observation made in that same post about the fact that STE is uniquely slow
  - The post is from 2016, so I’m guessing that things in the code have changed - but apparently not quite enough just yet

Here are some pretty pictures to show the observations

CPU on server during a typical submit
Timeout on submitting PC screen - if this happens, just leave it alone and it’ll hopefully finish the BG submit

ERPNext%20STE%20submit%20001%20-%20Screenshot%20from%202019-03-24%2013-52-151059×514 36.4 KB
IF the CPU on server during the submit+timeout drops before you get the “Submitted” message with a “ting”, then your timeout is inadequate.
Submit (eventually/successfully) despite timeout if you left it running anyway

ERPNext%20STE%20submit%20-%20(succesful%20BG%20submit)%20-%20Screenshot%20from%202019-03-24%2014-02-351398×468 43.7 KB

Hopefully this will be helpful. Comments/clues/help welcomed if there is a better way.

clarkej · April 13, 2019, 4:09pm

You have dug around lots and uncovered many prospective parameters at play -

This here Performance issue with v10 pos notes MySQL analysis tools documented here:

https://dev.mysql.com/doc/refman/8.0/en/show-processlist.html
https://dev.mysql.com/doc/refman/8.0/en/sys-processlist.html
https://dev.mysql.com/doc/refman/8.0/en/thread-information.html
https://dev.mysql.com/doc/refman/8.0/en/threads-table.html

They may help learn precisely what query or queries are spending their time that may point you to the holdup?

frappe@ubuntu:~/frappe-bench$ mysql -u root -p
Enter password: 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 348420
Server version: 10.2.13-MariaDB-10.2.13+maria~xenial-log mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> SELECT * FROM performance_schema.threads\G
Empty set (0.08 sec)

MariaDB [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| 1bd3e0294da19198   |
| d56cb677eaab3383   |
| information_schema |
| mysql              |
| performance_schema |
+--------------------+
5 rows in set (0.03 sec)

MariaDB [mysql]> use 1bd3e0294da19198;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [1bd3e0294da19198]> SELECT * FROM performance_schema.threads\G
Empty set (0.00 sec)

MariaDB [1bd3e0294da19198]> show processlist;
+--------+-------------+-----------+------------------+---------+------+--------------------------+------------------+----------+
| Id     | User        | Host      | db               | Command | Time | State                    | Info             | Progress |
+--------+-------------+-----------+------------------+---------+------+--------------------------+------------------+----------+
|      1 | system user |           | NULL             | Daemon  | NULL | InnoDB purge coordinator | NULL             |    0.000 |
|      3 | system user |           | NULL             | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|      2 | system user |           | NULL             | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|      4 | system user |           | NULL             | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|      5 | system user |           | NULL             | Daemon  | NULL | InnoDB shutdown handler  | NULL             |    0.000 |
| 348420 | root        | localhost | 1bd3e0294da19198 | Query   |    0 | init                     | show processlist |    0.000 |
+--------+-------------+-----------+------------------+---------+------+--------------------------+------------------+----------+
6 rows in set (0.04 sec)

MariaDB [1bd3e0294da19198]> SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST;
+--------+-------------+-----------+------------------+---------+------+--------------------------+----------------------------------------------+---------+-------+-----------+----------+-------------+---------------+----------+----------------------------------------------+------+
| ID     | USER        | HOST      | DB               | COMMAND | TIME | STATE                    | INFO                                         | TIME_MS | STAGE | MAX_STAGE | PROGRESS | MEMORY_USED | EXAMINED_ROWS | QUERY_ID | INFO_BINARY                                  | TID  |
+--------+-------------+-----------+------------------+---------+------+--------------------------+----------------------------------------------+---------+-------+-----------+----------+-------------+---------------+----------+----------------------------------------------+------+
| 348420 | root        | localhost | 1bd3e0294da19198 | Query   |    0 | Filling schema table     | SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST | 9.992   |     0 |         0 | 0.000    |       85296 |             0 |  3039715 | SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST | 2725 |
|      5 | system user |           | NULL             | Daemon  |    0 | InnoDB shutdown handler  | NULL                                         | 0.000   |     0 |         0 | 0.000    |       24360 |             0 |        0 | NULL                                         | 2543 |
|      4 | system user |           | NULL             | Daemon  |    0 | InnoDB purge worker      | NULL                                         | 0.000   |     0 |         0 | 0.000    |       24360 |             0 |        0 | NULL                                         | 2539 |
|      2 | system user |           | NULL             | Daemon  |    0 | InnoDB purge worker      | NULL                                         | 0.000   |     0 |         0 | 0.000    |       24360 |             0 |        0 | NULL                                         | 2537 |
|      3 | system user |           | NULL             | Daemon  |    0 | InnoDB purge worker      | NULL                                         | 0.000   |     0 |         0 | 0.000    |       24360 |             0 |        0 | NULL                                         | 2538 |
|      1 | system user |           | NULL             | Daemon  |    0 | InnoDB purge coordinator | NULL                                         | 0.000   |     0 |         0 | 0.000    |       24360 |             0 |        0 | NULL                                         | 2536 |
+--------+-------------+-----------+------------------+---------+------+--------------------------+----------------------------------------------+---------+-------+-----------+----------+-------------+---------------+----------+----------------------------------------------+------+
6 rows in set (0.02 sec)

This Slow Query Log utility may shed light on the problem https://mariadb.com/kb/en/library/slow-query-log-overview/

Just one of your 8x gunicorn workers does the work, presumably that is all that is required to respond to the front end HTTP requests and other housekeeping tasks.

clarkej · April 13, 2019, 5:22pm

This Record Lock Issue Version 4 lists the output from the query SHOW ENGINE INNODB STATUS\G that apparently concerns a record locking issue - perhaps look into that in your case - you stated one of your BOMs was particularly slow.