Really amazing team and community … many thanks
This is phenomenal, great cooperation and outcome. Looking forward to the sprint before the conference.
Thank you @rmehta for opportunity to work with you guys. Lot of new things learned. We will write about it in detail. Look forward for many such collaborative sprints.
@rmehta @Sachin_Mane Awesome work. Can someone write a detailed blog/article describing how this improvement was achieved?
Yes. Will summarise changes
Has this improved just the sales invoice or all doctypes?
~4 times lighter and faster ErpNext and more than 4 times happier Community
This article summarise the code changes done as part of the optimisation sprint we had in Frappe office 2 days back…
To start with, following is short ‘Before’ and ‘After’ summary -
Transaction - Submitting Sales Invoice containing 51 Items.
Bench Console
In [3]: si=frappe.get_last_doc('Sales Invoice')
In [4]: len(si.items)
Out[4]: 51
In [2]: %prun -s cumulative si.submit()
Results
Before optimisations
Time Taken : 4.56s
Number of SQL calls : 3625
After optimisations
Time Taken : 1.511s [First Time] and 0.981s [Subsequent Calls]
Number of SQL calls : 1070
It was evident that in a single transaction, same document was getting read from database again and again. get_doc
and get_value
method was getting called multiple times for the same document and same field.
General approach was to cache repetitively accessed documents in memory to reduce number of database calls. Following is the list of key changes done.
get_cached_doc instead of get_doc
First optimisation done was to cache documents in the local in-memory request level cache. A new request level cache is added in `frappe/__init__py`def init(site, sites_path=None, new_site=False):
"""Initialize frappe for the current site. Reset thread locals `frappe.local`"""
....
....
....
local.document_cache = {}
NOTE -
frappe.local is a request level cache and gets cleared before each request.
frappe.flags is also request level cache
A new method frappe.get_cached_doc
is introduced as follows -
def get_cached_doc(*args, **kwargs):
if args and len(args) > 1 and isinstance(args[1], text_type):
key = get_document_cache_key(args[0], args[1])
# local cache
doc = local.document_cache.get(key)
if doc:
return doc
# redis cache
doc = cache().hget('document_cache', key)
if doc:
doc = get_doc(doc)
local.document_cache[key] = doc
return doc
# database
doc = get_doc(*args, **kwargs)
return doc
The above method implements progressive checks in local request cache then in global redis cache and then falls back to database to fetch a document.
the get_doc
method is also modified to put the documents retrieved from database into cache as follows -
def get_doc(*args, **kwargs):
"""Return a `frappe.model.document.Document` object of the given type and name.
...
...
...
# set in cache
if args and len(args) > 1:
key = get_document_cache_key(args[0], args[1])
local.document_cache[key] = doc <-- set in request level cache
cache().hset('document_cache', key, doc.as_dict()) <-- set in global redis cache
and then there were selective find and replace based on the %prun output analysis. e.g. One sample %prun record showing lot of time spent in get_item_details
method.
ncalls tottime percall cumtime percall filename:lineno(function)
50 0.002 0.000 **1.809** 0.036 get_item_details.py:16(get_item_details)
All such high time taking methods were identified and get_doc
calls were replaced with get_cached_doc
in the same.
Following is %prun
output after this change
ncalls tottime percall cumtime percall filename:lineno(function)
50 0.002 0.000 **0.529** 0.011 get_item_details.py:19(get_item_details)
get_cached_value instead of get_value
Similar to ```get_doc```, ```get_value``` was another method which was getting called many time for same doctype and field. This was another candidate to be cache enabled.Following method is introduced which internally calls get_cahed_doc
and avoids a database call altogether.
def get_cached_value(doctype, name, fieldname, as_dict=False):
doc = get_cached_doc(doctype, name) <-- Gets cached doc. 1 db call saved per call
if isinstance(fieldname, text_type):
if as_dict:
throw('Cannot make dict for single fieldname')
return doc.get(fieldname)
values = [doc.get(f) for f in fieldname]
if as_dict:
return _dict(zip(fieldname, values))
return values
Then another selective find and replaced was done after analysing %prun
output. Following number of occurrences of get_value
were replaced
➜ apps grep -R 'get_cached_value' . | grep -v pyc | wc -l
186
Fixed Meta caching
Rushabh found out that doctype meta-data was not actually getting saved in redis cache. It was failing to serialise doctype metadata and silently eating up the exception.A custom serialisation method was introduced to serialize meta data as follows -
def as_dict(self, no_nulls = False):
def serialize(doc):
out = {}
for key in doc.__dict__:
value = doc.__dict__.get(key)
if isinstance(value, (list, tuple)):
if len(value) > 0 and hasattr(value[0], '__dict__'):
value = [serialize(d) for d in value]
else:
# non standard list object, skip
continue
if (isinstance(value, (frappe.text_type, int, float, datetime, list, tuple))
or (not no_nulls and value is None)):
out[key] = value
return out
return serialize(self)
After the custom serialization, redis started caching it. This was a big improvement reducing number of database calls significantly as it was getting used very frequently.
Although, these changes would improve performance of most of the transactions because of common codebase, we should %prun analyse every transactions to identify most time taking methods and selectively replace get_doc() and get_value() with get_cached_doc() and get_cached_value().
Great work
I wonder how this will impact recommendations on how much memory to allocate to redis cache. Right now, even for many concurrent users (100+), a small amount of memory for redis cache is enough. It looks like that will change with this.
awesome! way to go, guys!
Have you joined the developer group on Telegram. Here is the link:
I have often encountered the fact that i needed to call a get_doc only to call its instance methods. Often, these instance methods didn’t interact with any of the child tables, and these docs were loaded in hooks, hence, weren’t part of response.
Also, child tables can often be very large. Lazy loading of child tables can surely be a helpful addition to get_doc. Also, lazy loading should be a logical decision, instead of a configuration, which can be implemented by an additional argument called “lazy_load_children” which can be passed to get doc like
frappe.get_doc(doctype, docname, lazy_load_children=True)
Based on SP, I think we can optimize ERPNext in better way. Time may less then 1 sec.
Nice analysis @Sachin_Mane
Good work @Sachin_Mane @rmehta
I worry also about the potential for increased RAM memory requirements of the host platform. Memory is the single largest contributing factor to server costs. So a system with 4 or 5 gunicorn users and 12 live users logged in and processing sales, there could be an increased memory requirement of close to 500mb? Does that sound right?
Memory usually comes in 1gb increments and costs an additional $6 to $10 per month on the server fees invoice.
Have I read this correctly? Is that a reasonable expectation of additional memory overhead?
BKM
If you don’t allocate more memory to redis cache, then that will just mean more cache misses and lead to performance similar to current implementations.
You don’t have to increase redis cache if you don’t want to - everything should still work the same as before (i believe - please correct me if I’m wrong).
The analogy is with the innodb_buffer_pool MariaDB variable you’ve most likely modified in other implementations. You can have the value small, but more memory means better performance. It’ll be the same for this. Finding the recommended values will be the key - and everyone’s value proposition will be different. But until this is tested and someone reports the results, we won’t really know recommendations.
Lazy loading is good idea. However it may complicate the caching approach. With document cache enabled, lazy loading wont be needed.
Memory is cheap nowadays. You can choose your server configuration based on your current memory utilisation.
It is a tradeoff and as @felix mentioned, it is upto you to decide how much memory should be allocated.
get_cached_doc and get_cached_value won’t be implementing lazy loading param.
We use get cached doc for master data which is read multiple times, while lazy loading will be for documents which are read only once, and when reading its child tables is not necessary. That way, we can still use instance methods(Without having to manually create instance objects of document classes) without loading child table values (which can be in hundreds in some cases)
Makes sense to me.
A design question: currently when get_cached_doc
and get_cached_value
are called with a bad key they return a DoesNotExist
Error rather than None
. Is there a reason it works this way? I would think you’d want the two methods to have symmetrical APIs.