Deadlock issue on Frappe v14

Hello,

I have encountered issues with Deadlocks and Lock Wait Timeout Exceeded in ERPNext.

We have multiple systems operating in the production environment, such as POS, Sales, and more. We are building an intermediate system to synchronize data to ERPNext, with the goal of eventually transitioning the entire system to utilize ERPNext. However, our system is complex and consists of multiple components, making it necessary to transition one part at a time. Thus, we have a service in place to synchronize data from our existing system to ERPNext, which runs at 3 am every day.

We are currently facing problems related to Deadlocks and Lock Timeout errors. Our synchronization service retrieves data from the running system and creates documents in ERPNext. Specifically, we are synchronizing data related to Stock Entry. Sales data from our POS, Sales, and other systems are synchronized to ERPNext. Everything works fine until this synchronization service is executed in the production environment. Each day, we have around 2-3 thousand orders being processed. When the synchronization service processes around 200-300 documents, Deadlock or Lock Timeout errors start occurring. We have tried restarting the bench, but the same errors persist upon reattempting the process.

Currently, we are using the Rest API to create documents. We have attempted to use the api/resource with the frappe.client.insert_many command to create multiple documents, and weā€™ve also tried the api/resource/Stock%20Entry to create individual documents, but the same errors occur.

What can be done to improve and resolve this situation? Thank you for taking the time to assist me. It would be greatly appreciated if we could address and rectify this issue.

This is our Frappe ERPNext versions:

#### Installed Apps

**ERPNext:** v14.36.0 (version-14)
**Frappe Framework:** v14.46.0 (version-14)
**Frappe Helper:** v0.3.3 (master)
**Frappe HR:** v14.10.2 (version-14)
**Restaurant:** v1.3.7 (master)

if possible, try to customize form, Stock Entry, uncheck Track Changes.

also please override the below method

refer to this app https://gitee.com/yuzelin/zelin_am/blob/master/zelin_am/overrides.py

class CustomStockEntry(StockEntry):
def validate_qty(self):
pass

here the similar case https://gitee.com/yuzelin/erpnext-chinese-docs/issues/I6SCJQ, unfortunately in Chinese, but the code and setting changes in English, maybe you can use online translation tools.

1 Like

Thank you so much @szufisher

I see that you are skip validate qty. Can you please explain this solution? How can It works?

Thank you again.

the validate qty function does nothing useful, so in the app, just simply override the function with a pass statement.

1 Like

Thank you so much. Letā€™s me try it

On version 13, I have experienced a -ton- of problems with SQL Locks, Deadlocks, Timeouts. Iā€™ve probably spent a few hundred hours working on these topics.

This is extremely difficult work. I will try to offer some advice, but itā€™s hard to explain without writing a lot more.

Problem #1 - Missing/Bad Indexes

ERPNext is not fully optimized for performance. Sometimes, SQL indexes are missing that should exist. Other times, the Python code is doing inefficient or incorrect logic.

When Locks/Deadlocks/Timeouts happen, try to capture the related SQL queries.

  1. Configure your MariaDB server to use its Slow Query Log.
  2. Try to see whatā€™s happening in realtime, using SQL statements like SHOW FULL PROCESSLIST;

If you can find the queries causing the problems, you can take action. Such as optimizing indexes, or improving the Python code. The SQL command 'EXPLAIN' can help you determine if a query needs index optimization.

Problem #2 - Concurrency can produce Deadlocks

Many ERPNext tasks involve a lot of SQL tables. For example, Submitting a 'Delivery Note or ā€˜Sales Invoiceā€™ can interact with SQL tables (DocTypes) like:

  • Item
  • Customer
  • Sales Order and Sales Order Item
  • Delivery Note and Delivery Note Item
  • Sales Invoice and Sales Invoice Item
  • Sales Taxes and Charges
  • GL Entry
  • Stock Ledger Entry (this table is especially challenging)

If you are Submitting these documents sequentially? Normally that is okay. But what happens if you try to post multiple documents concurrently? (for example, by using background Workers)

What can sometimes happen is a Deadlock. Two or more processes/threads trying to CRUD the same SQL records, at the same time. Neither can complete until the other releases locks. So they deadlock.

Solving this is extremely challenging. It took me a long while to gain control over this:

  1. In many cases I tried to improve the performance by editing Python, SQL, or improving indexes. The faster the process, the less time SQL objects are locked.
  2. If my task did not need a full rollback? Then I would frappe.db.commit() where possible, to remove the locks earlier.
  3. I introduced my own ā€œSafety Lockingā€ mechanism:
    • First, I changed the MariaDB transaction isolation level to ā€œREAD COMMITTEDā€
    • Next, at the beginning of each risky Python function, I manually lock all the records I need.

By manually locking at the beginning, I force all other concurrent transactions to ā€œwaitā€ for locks to release. For example, Invoice #2 may have to wait a few seconds until Invoice #1 releases some locks.

Yes, this reduces concurrency (not ideal). But helps prevent deadlocks (really great). Ever since I introduced this concept into my code, Deadlocks are very, very rare in my environments.

Problem #3 - MariaDB doesnā€™t handle Locks gracefully

When MariaDB encounters long-running queries, or locks, the database tables can easily get ā€œstuckā€ in 2 different states:

  • ā€œCreating sort indexā€
  • ā€œSending dataā€

See screenshot below (from SQL command 'SHOW FULL PROCESSLIST')

Sometimes, MariaDB will remain ā€œstuckā€ in these states literally forever. The queries will never time out. Even if you use SQL to ā€œKILLā€ them, they will not disappear. Your database tables freeze. You cannot interact with them anymore. You must manually end the MariaDB process at the OS level.

Itā€™s awful when this happens. :face_exhaling: Iā€™ve spend a ton of time reading about these 2 states. So far, Iā€™ve found no way of teaching MariaDB to gracefully end them. Nothing Iā€™ve tried makes MariaDB unfreeze.

After dealing with this for 2+ years, my only plan is to get rid of MariaDB. And start using PostgreSQL instead. Iā€™ve worked with other database software for over a decade (e.g. MS SQL). Iā€™ve never had as many problems then, as I do today with MariaDB.

Problem #4 - ERPNext Design Decisions

Some of the ERPNext designs and architecture are adding to the problem. Things like:

  • The way ā€˜Stock Ledger Entryā€™ and ā€˜GL Entryā€™ were designed.
  • The number of hits that SQL table tabSeries experiencing to issue new Naming Series values.
  • Child and Parent table architecture.
  • A few bugs in Frappe Framework around Document deletions.
  • Certain stock and ledger code doing a lot of Looping.
  • SQL queries executing without the Userā€™s consent/intent (many Reports)

Finally, sometimes MariaDB also just chooses the wrong index, such as when users perform List Page filters.

There are no quick fixes to the above. To fully solve these problems, youā€™d have to fork your ERPNext, and begin maintaining your own version with your own custom solutions.

Summary

Hopefully some of this information was helpful. If nothing else, know this: You Are Not Alone. These problems are happening for other people too. The larger your database? The more transactions you try to process each day? The more-likely these problems will appear.

12 Likes

Thank you very much, @brian_pond.

Seems like the issue with MariaDB is quite significant here. I have tried integrating my site with PostgresDB, but there might be some apps that havenā€™t been updated for use with Postgres. Sometimes errors still occur with queries using Postgres.

Perhaps we need to wait for major changes from Frappe and ERPNext, as the interdependencies here are quite substantial. Iā€™ve only used ERPNext and HRMS with Frappe. I really hope ERPNext will update soon to fix these issues. I see many others facing similar problems, and Iā€™m trying to find the simplest and quickest way to resolve it. Weā€™re trying to transition from Odoo to ERPNext because ERPNextā€™s business process flow aligns well with our teamā€™s needs.

Are you currently using ERPNext for your system, @brian_pond ? Have you had to make many modifications to the Python code of Frappe or ERPNext?

The best advice I can give is this:

  • Do your best to try to solve these problems yourself.
  • Make friends with other developers, and work together to solve them.
  • Hire some help.

I wouldnā€™t suggest waiting and hoping something changes. You may wait a long time, and ultimately end up disappointed. Iā€™ve been working with Frappe framework and ERPNext for over 6 years. Yes, in that time some things have changed. But others things have never changed. What we want, versus what the official maintainers want, are often very different things.

Hope isnā€™t a reliable strategy. If you really want something done, youā€™ll need to work on it.

Technically, I am using ERPNext for my own, personal accounting. But that isnā€™t where most of my work happens. My day job is a consultant & developer who helps other companies with their ERPNext.

I have made many modifications to the Python code of both Frappe and ERPNext. Iā€™ve also written my own Apps, and created some Apps for my clients. The past few years, probably 85%+ of my job is software development.

If youā€™re looking for more info/inspirationā€¦

8 Likes

Thank you very much, @brian_pond ,

Thereā€™s been a lot of helpful information for me. I will search for ways to resolve the issue Iā€™m facing and reach out to someone for assistance if possible. It would be great if you know anyone who has encountered the same problem and needs like mine and has managed to fix this issue. Please let me know. I will get in touch with them to learn more.

Thank you very much once again.

Haha, me too, I also developed some customized apps which mostly used by local Chinese users, these APPs are available here https://gitee.com/yuzelin.

3 Likes

Thank you, @brian_pond for your comprehensive and insightful analysis of the concurrency and deadlock issues in ERPNext, particularly in version 13. Your experiences and the strategies youā€™ve employed provide valuable context for understanding the intricacies of these problems.

I completely relate to your struggles, as Iā€™ve been facing similar challenges with deadlocks in our mobile app, which enables users to place online orders. The simultaneous order placement by multiple users has frequently led to deadlock situations. Like you, I have invested numerous hours in addressing these issues, but the journey has been difficult.

Your point about ERPNextā€™s suboptimal optimization for concurrency and high load scenarios resonates deeply with our experiences. Weā€™ve observed that ERPNext, in its current state, seems not to be fully equipped to handle heavy concurrent transactions smoothly, which is crucial for applications like ours.

While our team is still investigating and trying to mitigate these issues, our lack of a highly skilled database developer has been a significant barrier. This limitation has led us to explore alternative methods to alleviate the load on the database. We are currently experimenting with background jobs and other techniques to offload some of the transactional processing, hoping to reduce the frequency and impact of deadlocks.

In light of your experiences and recommendations, I am also considering the feasibility of transitioning from MariaDB to PostgreSQL. This switch might present its own set of challenges but could potentially offer a more stable environment for handling complex transactions.

Itā€™s clear that addressing these challenges goes beyond simple fixes and may require significant modifications to ERPNextā€™s core. As you suggested, forking ERPNext and maintaining a customized version could be a path worth considering for more control and optimization tailored to our specific needs.

In summary, your detailed analysis has not only validated our experiences but also opened up new perspectives on tackling these complex issues. Itā€™s somewhat comforting to know that we are not alone in facing these challenges. The communityā€™s collective insights and efforts can hopefully lead to more robust solutions in the future.

Thank you once again for sharing your valuable experiences and strategies.

Regards,
Alaa

Thank you for the kind words @Alaa. You are not alone. These types of problems are definitely happening for other ERPNext users. Especially for people like us who are doing e-Commerce.

A few more things to share with you:

  1. Since my last thread above, I split a single-device ERPNext into 2 devices:

    • Host 1: ERPNext web server, Nginx, Redis databases, Background Jobs.
    • Host 2: MariaDB only.

    This will not improve SQL deadlocks. But it seems to have reduced the frequency of ā€œCreating sort indexā€ and ā€œSending dataā€ database locks.


  1. I disabled the ERPNext backups (ā€˜bench backupā€™, located in the cron jobs), and replaced with MariaDB Physical backups.

    • There are 2 kinds of backups with MariaDB: Logical and Physical. The 'bench backup' command creates a Logical backup.
    • Logical backups are convenient, but very stressful on large databases. They can consume up to 3 CPUs simultaneously(one for querying SQL, one for writing to disk, one for compression). Logical backups also take a long time to finish.

    MariaDB Physical backups (created with tool Mariabackup) are less-stressful on the host device. However, they do have many other problems. :slightly_frowning_face:

    1. Not as easy to perform backup or restores, when compared to Logical backups.
    2. Very difficult to backup and restore a single database only.
    3. You are responsible for building your own backup retention and rotation scripts.
    4. Incremental backups are supported, but complicated to configure.

Despite the new difficulties and challenges, moving to Physical Backups has reduced the load on our database and its host, which helps the overall ERP environmentā€™s performance.


  1. Table Cleanup and Archiving:
  • Every day, I have a Background Task that deletes older rows from SQL tables. Examples include:

    • __global_search
    • tabCommunication Link
    • tabScheduled Job Log
    • tabVersion
    • tabDeleted Document
    • tabAccess Log
    • Orphaned records in tables like 'tabDynamic Link', and 'tabEmail Queue Recipient'
  • Iā€™ve also started archiving historic SQL records to a 2nd MariaDB database. For example, older Delivery Notes.

  • I have also archived historic 'Stock Ledger Entry' records. It was difficult, and required new Python code. But it made a big difference in performance.


Next month, I hope to perform an ERPNext v15 upgrade. Iā€™m running forked versions of Frappe and ERPNext, so upgrades take a bit longer. But itā€™s worth it, imo.

Once the upgrade is completed, Iā€™ll begin the switch to PostgreSQL. I suspect it will have some challenges and problems too. And it wonā€™t solve all the concurrency problems in ERPNext. But Iā€™m hopeful that itā€™s an improvement over MariaDB. Iā€™ll find out soonā€¦

5 Likes

waiting for your sharing.

Hi @brian_pond,

Thanks for sharing your recent strategies and experiences.

I strongly recommend upgrading to newer versions of ERPNext when you can. From V14 onwards, ERPNext has got significant improvements in optimization. For instance, the way GL Entry creation and posting are handled has been greatly enhanced. The overall process is now more efficient.

Regarding the ERPNext backups, you can actually disable them directly from the System Settings. Just set the backup quantity to 0, and it should stop the automatic ā€˜bench backupā€™ process.

From V14 onwards, ERPNext introduced the ā€˜Log Settingsā€™ doctype. This feature allows you to select certain doctypes to be automatically cleared daily, helping with database size and performance. However, the doctypes available for this setting are pre-selected.

Another useful feature from V14 is the ā€˜Database Storage Usage By Tablesā€™ report. It gives valuable insights into how your database storage is being used.

You might also want to look into disabling the Track Views/Track Changes features for certain doctypes. This can be done through the Customize Form or directly via property setters, for doctypes where customization options are limited such as User doctype.

Our team has found that writing direct SQL queries, instead of relying solely on Frappe ORM, offers more control and efficiency. This approach lets you use features like internal locking (more info here: https://dev.mysql.com/doc/refman/8.0/en/internal-locking.html).

To handle Global Search, consider disabling it in the site_config.json file.

Weā€™ve also implemented a read replica of MariaDB, which is natively supported by Frappe. While this has been effective in certain aspects, weā€™ve noticed that it has a limited impact when it comes to API performance.

I agree with your approach to archiving. Weā€™ve been archiving tables like sold Serial Numbers. However, Iā€™m curious about your experience with archiving Stock Ledger Entries and GL Entries. Weā€™re considering adopting a similar strategy, but thereā€™s a concern on our end: if we delete old records, how might this impact the accuracy of opening balances in reports that begin after the deletion date?

Again, upgrading to newer versions like V15 could offer more tools and efficiencies for these challenges.

Looking forward to hearing more about your journey, especially your switch to PostgreSQL.

Best,
Alaa

2 Likes

this is the new feature which address your concern when archiving old GL Entries.

1 Like

To archive Stock Ledger Entry, here is what Iā€™m doing:

  1. Created a custom Python function.
    • First argument = Item code.
    • Second argument = Warehouse name.
  2. Pseudocode for this function:
    • Record the current DateTime.
    • Calculate or find the current Stock Balance for this Item + Warehouse.
    • Archive all Stock Ledger Entry records, where the posting date is less than DateTime. To archive, Iā€™m inserting SQL rows into a 2nd database, and deleting rows from main database.
    • Create a brand-new Stock Reconciliation document, purpose = ā€œOpening Stockā€
    • Add lines for your Item, Warehouse, Quantity, and Value.
    • Submit the Stock Reconciliation document.

At the 1 client where Iā€™ve implemented this, it achieved the desired goal: removing SQL rows from Stock Ledger Entry, but maintaining the same quantity and balance.

Note 1: Use at your own Risk. This is a very simple archival. Iā€™m not doing anything with Batches/Serials. Iā€™m also not worrying about database integrity and creating orphaned Links (other Documents that ā€œpointā€ to Stock Ledger Entry rows that are now missing). Other ERPNext clients may need to adjust this, or find a completely different approach.

Note 2: Very important that this function is not run while ā€˜Stock Ledger Entryā€™ is being written to. If your ERP has usage 24/7, you may need to schedule a maintenance window where Users cannot login to ERPNext.

Very glad to hear this was added to ERPNext. :+1: Iā€™ve been doing this with manual SQL.

Both of these features are important for my clients. So I have to keep them enabled (for important DocTypes). But I also donā€™t need history older than N days/months. So Iā€™m just deleting records when they become stale.

Same here! :slightly_smiling_face: Iā€™m using mine as a staging area for a Data Warehouse.

2 Likes

I agree that upgrades can be helpful/important.

For my clients, an upgrade is a large project. It must be discussed with leadership, and the benefits evaluated versus the benefit of other, potential projects. New ERPNext performance and features are cool. But perhaps not as beneficial to the company, when compared to spending Developer time on something else.

Once an upgrade is approved, we schedule it on the business calendar. We need not just developerā€™s time to perform the Data + Code upgrade. But also several employeesā€™ time for performing Regression Testing. Every ERP feature (and integration) must be thoroughly tested. We need confidence the new Version didnā€™t not break any features, workflows, employee SOPs, custom Reports, etc.

Once everything is perfect, weā€™ll schedule a time for changing the Production environment. Afterwards, there is employee support phase. There are always bugs that were missed. New features that behave differently or are confusing. People may need assistance with the new UI changes (seems like the UI always changes since v11)

So, while the benefits of upgrading are clear to us, upgrade projects are kind of a big deal. They usually only happen every 2-3 years.

3 Likes

Hi:

Maybe some interesting stuff about DB optimization here.

2 Likes