Deadlock issue on Frappe v14

sinhpntech · August 26, 2023, 3:49am

Hello,

I have encountered issues with Deadlocks and Lock Wait Timeout Exceeded in ERPNext.

We have multiple systems operating in the production environment, such as POS, Sales, and more. We are building an intermediate system to synchronize data to ERPNext, with the goal of eventually transitioning the entire system to utilize ERPNext. However, our system is complex and consists of multiple components, making it necessary to transition one part at a time. Thus, we have a service in place to synchronize data from our existing system to ERPNext, which runs at 3 am every day.

We are currently facing problems related to Deadlocks and Lock Timeout errors. Our synchronization service retrieves data from the running system and creates documents in ERPNext. Specifically, we are synchronizing data related to Stock Entry. Sales data from our POS, Sales, and other systems are synchronized to ERPNext. Everything works fine until this synchronization service is executed in the production environment. Each day, we have around 2-3 thousand orders being processed. When the synchronization service processes around 200-300 documents, Deadlock or Lock Timeout errors start occurring. We have tried restarting the bench, but the same errors persist upon reattempting the process.

Currently, we are using the Rest API to create documents. We have attempted to use the api/resource with the frappe.client.insert_many command to create multiple documents, and we’ve also tried the api/resource/Stock%20Entry to create individual documents, but the same errors occur.

What can be done to improve and resolve this situation? Thank you for taking the time to assist me. It would be greatly appreciated if we could address and rectify this issue.

This is our Frappe ERPNext versions:

#### Installed Apps

**ERPNext:** v14.36.0 (version-14)
**Frappe Framework:** v14.46.0 (version-14)
**Frappe Helper:** v0.3.3 (master)
**Frappe HR:** v14.10.2 (version-14)
**Restaurant:** v1.3.7 (master)

szufisher · August 26, 2023, 7:30am

if possible, try to customize form, Stock Entry, uncheck Track Changes.

szufisher · August 26, 2023, 7:40am

also please override the below method

refer to this app https://gitee.com/yuzelin/zelin_am/blob/master/zelin_am/overrides.py

class CustomStockEntry(StockEntry):
def validate_qty(self):
pass

here the similar case https://gitee.com/yuzelin/erpnext-chinese-docs/issues/I6SCJQ, unfortunately in Chinese, but the code and setting changes in English, maybe you can use online translation tools.

sinhpntech · August 26, 2023, 8:41am

Thank you so much @szufisher

I see that you are skip validate qty. Can you please explain this solution? How can It works?

Thank you again.

szufisher · August 26, 2023, 10:17am

the validate qty function does nothing useful, so in the app, just simply override the function with a pass statement.

sinhpntech · August 26, 2023, 4:16pm

Thank you so much. Let’s me try it

brian_pond · August 26, 2023, 7:27pm

On version 13, I have experienced a -ton- of problems with SQL Locks, Deadlocks, Timeouts. I’ve probably spent a few hundred hours working on these topics.

This is extremely difficult work. I will try to offer some advice, but it’s hard to explain without writing a lot more.

Problem #1 - Missing/Bad Indexes

ERPNext is not fully optimized for performance. Sometimes, SQL indexes are missing that should exist. Other times, the Python code is doing inefficient or incorrect logic.

When Locks/Deadlocks/Timeouts happen, try to capture the related SQL queries.

Configure your MariaDB server to use its Slow Query Log.
Try to see what’s happening in realtime, using SQL statements like SHOW FULL PROCESSLIST;

If you can find the queries causing the problems, you can take action. Such as optimizing indexes, or improving the Python code. The SQL command 'EXPLAIN' can help you determine if a query needs index optimization.

Problem #2 - Concurrency can produce Deadlocks

Many ERPNext tasks involve a lot of SQL tables. For example, Submitting a 'Delivery Note or ‘Sales Invoice’ can interact with SQL tables (DocTypes) like:

Item
Customer
Sales Order and Sales Order Item
Delivery Note and Delivery Note Item
Sales Invoice and Sales Invoice Item
Sales Taxes and Charges
GL Entry
Stock Ledger Entry (this table is especially challenging)

If you are Submitting these documents sequentially? Normally that is okay. But what happens if you try to post multiple documents concurrently? (for example, by using background Workers)

What can sometimes happen is a Deadlock. Two or more processes/threads trying to CRUD the same SQL records, at the same time. Neither can complete until the other releases locks. So they deadlock.

Solving this is extremely challenging. It took me a long while to gain control over this:

In many cases I tried to improve the performance by editing Python, SQL, or improving indexes. The faster the process, the less time SQL objects are locked.
If my task did not need a full rollback? Then I would frappe.db.commit() where possible, to remove the locks earlier.
I introduced my own “Safety Locking” mechanism:
- First, I changed the MariaDB transaction isolation level to “READ COMMITTED”
- Next, at the beginning of each risky Python function, I manually lock all the records I need.

By manually locking at the beginning, I force all other concurrent transactions to “wait” for locks to release. For example, Invoice #2 may have to wait a few seconds until Invoice #1 releases some locks.

Yes, this reduces concurrency (not ideal). But helps prevent deadlocks (really great). Ever since I introduced this concept into my code, Deadlocks are very, very rare in my environments.

Problem #3 - MariaDB doesn’t handle Locks gracefully

When MariaDB encounters long-running queries, or locks, the database tables can easily get “stuck” in 2 different states:

“Creating sort index”
“Sending data”

See screenshot below (from SQL command 'SHOW FULL PROCESSLIST')

Sometimes, MariaDB will remain “stuck” in these states literally forever. The queries will never time out. Even if you use SQL to “KILL” them, they will not disappear. Your database tables freeze. You cannot interact with them anymore. You must manually end the MariaDB process at the OS level.

It’s awful when this happens. I’ve spend a ton of time reading about these 2 states. So far, I’ve found no way of teaching MariaDB to gracefully end them. Nothing I’ve tried makes MariaDB unfreeze.

After dealing with this for 2+ years, my only plan is to get rid of MariaDB. And start using PostgreSQL instead. I’ve worked with other database software for over a decade (e.g. MS SQL). I’ve never had as many problems then, as I do today with MariaDB.

Problem #4 - ERPNext Design Decisions

Some of the ERPNext designs and architecture are adding to the problem. Things like:

The way ‘Stock Ledger Entry’ and ‘GL Entry’ were designed.
The number of hits that SQL table tabSeries experiencing to issue new Naming Series values.
Child and Parent table architecture.
A few bugs in Frappe Framework around Document deletions.
Certain stock and ledger code doing a lot of Looping.
SQL queries executing without the User’s consent/intent (many Reports)

Finally, sometimes MariaDB also just chooses the wrong index, such as when users perform List Page filters.

There are no quick fixes to the above. To fully solve these problems, you’d have to fork your ERPNext, and begin maintaining your own version with your own custom solutions.

Summary

Hopefully some of this information was helpful. If nothing else, know this: You Are Not Alone. These problems are happening for other people too. The larger your database? The more transactions you try to process each day? The more-likely these problems will appear.

sinhpntech · August 27, 2023, 4:19am

Thank you very much, @brian_pond.

Seems like the issue with MariaDB is quite significant here. I have tried integrating my site with PostgresDB, but there might be some apps that haven’t been updated for use with Postgres. Sometimes errors still occur with queries using Postgres.

Perhaps we need to wait for major changes from Frappe and ERPNext, as the interdependencies here are quite substantial. I’ve only used ERPNext and HRMS with Frappe. I really hope ERPNext will update soon to fix these issues. I see many others facing similar problems, and I’m trying to find the simplest and quickest way to resolve it. We’re trying to transition from Odoo to ERPNext because ERPNext’s business process flow aligns well with our team’s needs.

Are you currently using ERPNext for your system, @brian_pond ? Have you had to make many modifications to the Python code of Frappe or ERPNext?

brian_pond · August 27, 2023, 8:37am

The best advice I can give is this:

Do your best to try to solve these problems yourself.
Make friends with other developers, and work together to solve them.
Hire some help.

I wouldn’t suggest waiting and hoping something changes. You may wait a long time, and ultimately end up disappointed. I’ve been working with Frappe framework and ERPNext for over 6 years. Yes, in that time some things have changed. But others things have never changed. What we want, versus what the official maintainers want, are often very different things.

Hope isn’t a reliable strategy. If you really want something done, you’ll need to work on it.

Technically, I am using ERPNext for my own, personal accounting. But that isn’t where most of my work happens. My day job is a consultant & developer who helps other companies with their ERPNext.

I have made many modifications to the Python code of both Frappe and ERPNext. I’ve also written my own Apps, and created some Apps for my clients. The past few years, probably 85%+ of my job is software development.

If you’re looking for more info/inspiration…

The amazing @gavindsouza maintains a GitHub repo named “Awesome Frappe” that contains links to tons of 3rd party creations.
The fantastic @TurkerTunali wrote a nice blog about ERPNext Essentials.
There’s also an official web page about Performance Tuning here.

sinhpntech · August 27, 2023, 8:58am

Thank you very much, @brian_pond ,

There’s been a lot of helpful information for me. I will search for ways to resolve the issue I’m facing and reach out to someone for assistance if possible. It would be great if you know anyone who has encountered the same problem and needs like mine and has managed to fix this issue. Please let me know. I will get in touch with them to learn more.

Thank you very much once again.

szufisher · August 27, 2023, 9:22am

Haha, me too, I also developed some customized apps which mostly used by local Chinese users, these APPs are available here https://gitee.com/yuzelin.

Alaa · December 16, 2023, 4:39pm

Thank you, @brian_pond for your comprehensive and insightful analysis of the concurrency and deadlock issues in ERPNext, particularly in version 13. Your experiences and the strategies you’ve employed provide valuable context for understanding the intricacies of these problems.

I completely relate to your struggles, as I’ve been facing similar challenges with deadlocks in our mobile app, which enables users to place online orders. The simultaneous order placement by multiple users has frequently led to deadlock situations. Like you, I have invested numerous hours in addressing these issues, but the journey has been difficult.

Your point about ERPNext’s suboptimal optimization for concurrency and high load scenarios resonates deeply with our experiences. We’ve observed that ERPNext, in its current state, seems not to be fully equipped to handle heavy concurrent transactions smoothly, which is crucial for applications like ours.

While our team is still investigating and trying to mitigate these issues, our lack of a highly skilled database developer has been a significant barrier. This limitation has led us to explore alternative methods to alleviate the load on the database. We are currently experimenting with background jobs and other techniques to offload some of the transactional processing, hoping to reduce the frequency and impact of deadlocks.

In light of your experiences and recommendations, I am also considering the feasibility of transitioning from MariaDB to PostgreSQL. This switch might present its own set of challenges but could potentially offer a more stable environment for handling complex transactions.

It’s clear that addressing these challenges goes beyond simple fixes and may require significant modifications to ERPNext’s core. As you suggested, forking ERPNext and maintaining a customized version could be a path worth considering for more control and optimization tailored to our specific needs.

In summary, your detailed analysis has not only validated our experiences but also opened up new perspectives on tackling these complex issues. It’s somewhat comforting to know that we are not alone in facing these challenges. The community’s collective insights and efforts can hopefully lead to more robust solutions in the future.

Thank you once again for sharing your valuable experiences and strategies.

Regards,
Alaa

brian_pond · December 16, 2023, 8:23pm

Thank you for the kind words @Alaa. You are not alone. These types of problems are definitely happening for other ERPNext users. Especially for people like us who are doing e-Commerce.

A few more things to share with you:

Since my last thread above, I split a single-device ERPNext into 2 devices:
- Host 1: ERPNext web server, Nginx, Redis databases, Background Jobs.
- Host 2: MariaDB only.
This will not improve SQL deadlocks. But it seems to have reduced the frequency of “Creating sort index” and “Sending data” database locks.

I disabled the ERPNext backups (‘bench backup’, located in the cron jobs), and replaced with MariaDB Physical backups.
- There are 2 kinds of backups with MariaDB: Logical and Physical. The 'bench backup' command creates a Logical backup.
- Logical backups are convenient, but very stressful on large databases. They can consume up to 3 CPUs simultaneously(one for querying SQL, one for writing to disk, one for compression). Logical backups also take a long time to finish.
MariaDB Physical backups (created with tool Mariabackup) are less-stressful on the host device. However, they do have many other problems.
1. Not as easy to perform backup or restores, when compared to Logical backups.
2. Very difficult to backup and restore a single database only.
3. You are responsible for building your own backup retention and rotation scripts.
4. Incremental backups are supported, but complicated to configure.

Despite the new difficulties and challenges, moving to Physical Backups has reduced the load on our database and its host, which helps the overall ERP environment’s performance.

Table Cleanup and Archiving:

Every day, I have a Background Task that deletes older rows from SQL tables. Examples include:
- __global_search
- tabCommunication Link
- tabScheduled Job Log
- tabVersion
- tabDeleted Document
- tabAccess Log
- Orphaned records in tables like 'tabDynamic Link', and 'tabEmail Queue Recipient'
I’ve also started archiving historic SQL records to a 2nd MariaDB database. For example, older Delivery Notes.
I have also archived historic 'Stock Ledger Entry' records. It was difficult, and required new Python code. But it made a big difference in performance.

Next month, I hope to perform an ERPNext v15 upgrade. I’m running forked versions of Frappe and ERPNext, so upgrades take a bit longer. But it’s worth it, imo.

Once the upgrade is completed, I’ll begin the switch to PostgreSQL. I suspect it will have some challenges and problems too. And it won’t solve all the concurrency problems in ERPNext. But I’m hopeful that it’s an improvement over MariaDB. I’ll find out soon…

szufisher · December 17, 2023, 6:21am

waiting for your sharing.

Alaa · December 17, 2023, 6:34am

Hi @brian_pond,

Thanks for sharing your recent strategies and experiences.

I strongly recommend upgrading to newer versions of ERPNext when you can. From V14 onwards, ERPNext has got significant improvements in optimization. For instance, the way GL Entry creation and posting are handled has been greatly enhanced. The overall process is now more efficient.

Regarding the ERPNext backups, you can actually disable them directly from the System Settings. Just set the backup quantity to 0, and it should stop the automatic ‘bench backup’ process.

From V14 onwards, ERPNext introduced the ‘Log Settings’ doctype. This feature allows you to select certain doctypes to be automatically cleared daily, helping with database size and performance. However, the doctypes available for this setting are pre-selected.

Another useful feature from V14 is the ‘Database Storage Usage By Tables’ report. It gives valuable insights into how your database storage is being used.

You might also want to look into disabling the Track Views/Track Changes features for certain doctypes. This can be done through the Customize Form or directly via property setters, for doctypes where customization options are limited such as User doctype.

Our team has found that writing direct SQL queries, instead of relying solely on Frappe ORM, offers more control and efficiency. This approach lets you use features like internal locking (more info here: https://dev.mysql.com/doc/refman/8.0/en/internal-locking.html).

To handle Global Search, consider disabling it in the site_config.json file.

We’ve also implemented a read replica of MariaDB, which is natively supported by Frappe. While this has been effective in certain aspects, we’ve noticed that it has a limited impact when it comes to API performance.

I agree with your approach to archiving. We’ve been archiving tables like sold Serial Numbers. However, I’m curious about your experience with archiving Stock Ledger Entries and GL Entries. We’re considering adopting a similar strategy, but there’s a concern on our end: if we delete old records, how might this impact the accuracy of opening balances in reports that begin after the deletion date?

Again, upgrading to newer versions like V15 could offer more tools and efficiencies for these challenges.

Looking forward to hearing more about your journey, especially your switch to PostgreSQL.

Best,
Alaa

szufisher · December 17, 2023, 12:02pm

this is the new feature which address your concern when archiving old GL Entries.

brian_pond · December 17, 2023, 7:54pm

To archive Stock Ledger Entry, here is what I’m doing:

Created a custom Python function.
- First argument = Item code.
- Second argument = Warehouse name.
Pseudocode for this function:
- Record the current DateTime.
- Calculate or find the current Stock Balance for this Item + Warehouse.
- Archive all Stock Ledger Entry records, where the posting date is less than DateTime. To archive, I’m inserting SQL rows into a 2nd database, and deleting rows from main database.
- Create a brand-new Stock Reconciliation document, purpose = “Opening Stock”
- Add lines for your Item, Warehouse, Quantity, and Value.
- Submit the Stock Reconciliation document.

At the 1 client where I’ve implemented this, it achieved the desired goal: removing SQL rows from Stock Ledger Entry, but maintaining the same quantity and balance.

Note 1: Use at your own Risk. This is a very simple archival. I’m not doing anything with Batches/Serials. I’m also not worrying about database integrity and creating orphaned Links (other Documents that “point” to Stock Ledger Entry rows that are now missing). Other ERPNext clients may need to adjust this, or find a completely different approach.

Note 2: Very important that this function is not run while ‘Stock Ledger Entry’ is being written to. If your ERP has usage 24/7, you may need to schedule a maintenance window where Users cannot login to ERPNext.

Very glad to hear this was added to ERPNext. I’ve been doing this with manual SQL.

Both of these features are important for my clients. So I have to keep them enabled (for important DocTypes). But I also don’t need history older than N days/months. So I’m just deleting records when they become stale.

Same here! I’m using mine as a staging area for a Data Warehouse.

brian_pond · December 17, 2023, 7:54pm

I agree that upgrades can be helpful/important.

For my clients, an upgrade is a large project. It must be discussed with leadership, and the benefits evaluated versus the benefit of other, potential projects. New ERPNext performance and features are cool. But perhaps not as beneficial to the company, when compared to spending Developer time on something else.

Once an upgrade is approved, we schedule it on the business calendar. We need not just developer’s time to perform the Data + Code upgrade. But also several employees’ time for performing Regression Testing. Every ERP feature (and integration) must be thoroughly tested. We need confidence the new Version didn’t not break any features, workflows, employee SOPs, custom Reports, etc.

Once everything is perfect, we’ll schedule a time for changing the Production environment. Afterwards, there is employee support phase. There are always bugs that were missed. New features that behave differently or are confusing. People may need assistance with the new UI changes (seems like the UI always changes since v11)

So, while the benefits of upgrading are clear to us, upgrade projects are kind of a big deal. They usually only happen every 2-3 years.

avc · December 17, 2023, 8:18pm

Hi:

Maybe some interesting stuff about DB optimization here.

github.com/frappe/press

Semi-auto DB Optimizer

opened 01:20PM - 14 Dec 23 UTC

ankush

First cut using slow query log. Next iterations can use perf schema maybe :shrug…: - [x] Analyze slow queries and find top problematic queries - https://github.com/frappe/press/pull/1261 - [ ] lower slow query threshold to somewhere around 0.5 second so we don't need to worry much about missing slow but repetitive queries. Depends on https://github.com/frappe/press/issues/984 - [x] Add FW command to generate table stats - https://github.com/frappe/frappe/pull/23813 - [x] Fetch table stats using using agent - https://github.com/frappe/agent/pull/71 - [x] Analyze a single query for building indexing suggestion - https://github.com/frappe/press/pull/1266 - [x] Fetch data using agent and apply on analyzer #1270 - [ ] Use EXPLAIN plans - [ ] One-click -> add index on live site - [x] Framework - command to add and persist custom index https://github.com/frappe/frappe/pull/23787 - [ ] Build a UI for this - Only after D2 (?) - [ ] List/report view of all slow queries - [ ] Form for analyzing single query with "analyze" - [ ] Query, explain plans - [ ] Suggested indexes, indexes in use - [ ] refetch explain TO check: - [ ] agent timeouts for large table stats job (should be a job?)

Gembira_IT_Tech · August 15, 2024, 12:03pm

Looks like the issue resolve on mariadb 10.6.18.

https://mariadb.com/kb/en/mariadb-10-6-18-release-notes/