Has anyone successfully managed to Archive old Data from ERPNext?

bkm · March 23, 2021, 5:39pm

I have a system with a database of 5gb and growing larger by 100mb every 3 weeks. When the database reached 4gb everything started slowing down.

I added a bunch of resources:

doubled the memory
doubled the vCPU count
added 50gb to the disk size
increased gunicorn workers
increased pool sizes
added badnwidth

All this made very little improvement in speed of operation. I believe I have found the tipping point of the frappe/erpnext database handling ability.

I need to know if anyone has EVER found a way to archive off old data to a separate database. I want to do this in order to reduce the total size of the database and only keep 2 fiscal years of old data and the current active year in the live system. All other years I want to archive off so they can be referenced on a different server running the same version.

With ERPNext being a viable product now for so many years, I cannot imagine that there is not a way to do this. I know I have not seen it as a feature in the system so there must be some other way to accomplish this.

If anyone can point me to past success in this where I can learn how it was done… that would be awesome.

If anyone has done it and maybe wants to repeat their work for me on a contract basis, send me a PM.

Mostly I would like to find a way to do this so that others in the community can benefit from the experience as well.

~BKM

brian_pond · March 23, 2021, 7:02pm

Archiving historic data on an ongoing basis? Easy.

Identify historic records in Database A, based on transaction date.
Write those records to Database B.
Delete the transactional records in Database A (being safe, and double-checking they exist in Database B right before you delete)
Repeat as needed.

This could be done with pure SQL, scheduled in a cron job. Or pure Python. Or even an ERPNext scheduled job. Tons of possibilities. If you mess up? No problem. You just restore records from Database B back into Database A.

^ This is the difficult requirement. If you didn’t want a 2nd ERPNext? Then everything I wrote above is easily accomplished.

So, how would this work?

To have a 2nd, fully-operational ERPNext: Database B needs a complete set of records from Database A. Not only the historic transactions you archived. But everything else too. Customers, Suppliers, Items, Users, Permissions, Singles, Links. All the tables.

To achieve that? You need an ongoing synchronization event from A to B. However…you cannot use something easy, like Replication. Because archiving records means eventually executing a DELETE statement in Database A. The replication ensures that same DELETE statement would repeat in Database B. And you just lost the records you’re supposed to be archiving.

Option 1: Doing what you asked for:

Create a standalone Database C. Empty. Has nothing to do with ERPNext.
Create an ongoing Replication from Database A to Database B. MariaDB Replication - MariaDB Knowledge Base
You now have 2 identical ERPNext running in parallel. Great.
A scheduler starts the Archive Task. All records designated as “archive-able” in Database A are copied into Database C. It saves them forever.
Next, the scheduler deletes the records in Database A. Performance benefit achieved!
Because of SQL replication, Database B is going to automatically delete those records too…
…but luckily we saved them in standalone Database C!
The final step of the Archive Task is to reinsert the records from Database C, back into B. Leaving database B with everything that A once had.

Option 2: How I’d solve this:

Personally, I’d probably go in a different direction entirely. I’d build a Data Warehouse and do BI reporting.

If users want real-time reporting, or recent history, they run traditional ERPNext reports.
For anything historic and archived, they’d run Data Warehouse reports.

Building a DW is not trivial. And you’d need some kind of Frontend for the reports. But it’s safe, scales, and avoids the headaches of a 2nd ERPNext, replication pitfalls, DB schema changes when you bench upgrade, timing, etc.

bkm · March 23, 2021, 7:51pm

Okay… I see your point here. The type of reporting that anyone would want from archived data would likely be summary reports on inventory trends, profit numbers that track seasons of the year, etc. They might even want to see how well a particular new product did historically.

While ERPNext has reports to serve these functions, the entire ERPNext system would NOT be required in order to just get reports out of the data.

So… In keeping with the theme of the original post and then incorporating your new information:

Would the system of archiving data (in your “Option 1”) into “database C” essentially be the data warehouse?
If the archiving were done only 1 time per year after the closing of the fiscal year, then this process would be relatively non-intrusive to the system in general.
Now having this “database C” as an unattached data set with no other system connections, could we use something like one of the SQL report generators (your reference to a “Frontend for reports”) available in the open source wilderness to provide the kind of report tasks that would have otherwise been available in ERPNext?

I get it. The task is not as straight forward as having 2 seemingly identical ERPNext instances with different data. My very rusty knowledge of database functions bit me again. Thank you for the “gentle” re-education.

So, the reason I opened this up as a combining of option1 and option 2 is I believe there might be some compromise here that does not create large technical challenges. If there is a way to make the data warehouse concept work without making my head explode, then I would want to do it that way.

I guess I would have to figure out how the “Database C” could be structured to allow for this to work and then come up with the archiving process to move old data there.

I guess I would also have to figure out the “Frontend” reporting package to use as that might impact how the data get structured in “Database C”

It will obviously take more planning than I had originally thought, but I am not in a position to shy away from it now. I have a problem to solve, and this looks like the path to an answer.

Thanks @brian_pond

~BKM

olamide_shodunke · March 23, 2021, 8:47pm

This is a very important and interesting conversation.

Bookmarking this for updates

bkm · March 23, 2021, 10:14pm

@brian_pond I know you are not a fan of attempting to keep a bunch of parallel ERPNext servers in sync, but please review this idea and give me your thoughts.
(I have such servers already because of how I run backups.)

Okay… Here is another possible approach.

I am not sure if this would work for everyone, but it might work for me in particular because I maintain multiple, identical servers for each of my clients. I do this in order to make use of my Poor Man’s Backup System v2

In my poor mans backup, I use server images to replicate the server multiple times under sub-domains and then send hourly backups to these servers of the entire live database from the production server.

So… Having these backup servers with very complete databases on them, maybe another approach could be done like this:

create another server using the client server image
port the latest backup to the new server
run the OPPOSITE of a data archive and delete all NEW records after a particular historical date (like the fiscal year transition).
Once all NEW records are gone, make another backup of the now smaller database

Okay… so why do this you might ask?

This approach would give me database backup of the old data I was hoping to archive off using the old data first method that I started this thread on. By deleting all of the newest records and leaving only the old records in the database that I wanted to archive anyway, I now have a valid (and restorable) database of the archive old data that can be loaded up on any of the backup servers to be used for historical research.

To make this method work one would have to delete all of the records that followed the closing of the first fiscal year of the system data. That would give you year one of your archive.

To get year two, you would do the same process except you delete all of year one old data, then also delete all data following the close of fiscal year two. Now you have a restorable database of the year two data.

Do the same for all subsequent years and you eventually wind up with a restorable database for each archived fiscal year. When you have all of the years archived in the order you want them, the final step would be to take the entire database one last time and delete all of the old data (that now exists as yearly restorable databases).

In my best guess, this should give you a set of archive databases for each fiscal year, plus a smaller live system because you have moved all of that old data out of your way. With all of this setup you should then be able to simply run through the first step again at the close of the next fiscal year to create the next in sequence archive database of an old fiscal year.

This is all THEORY !!!

It may very well blow up somewhere, but I gotta ask…

Do you think this might work?

Again, This is my current theoretical approach that would work for ME because I already have this multiple parallel server thing going on in order to keep my poor mans backup system running. If you are using a similar system then maybe it could work for you as well.

I need someone with great database knowledge and skills to let me know if this might work. Any such people in the community here?

This is still an exercise in progress. Have not decided on a good approach yet, but investigating everything.

~BKM

max_morais_dmm · March 23, 2021, 10:45pm

@bkm did tou have looked for Database Indexes?

Im asking cause I do have an installation of ERPNext since 2013, right now my database have 28Gb, and It runs smotly in a server with 8Gb memory and 250Gb Hard Disk.

Its an really old installation, in a really old server, and still, I can query 300K products in 12 seconds.

When my database started to become slow, I discovered that 75% of the latency was due poor Indexes.

After enable slow query log, and properly apply indexes on db, the performance boosted.

Right now, from the 12seconds

2 seconds is Db (that’s acceptable)
4 seconds is Python (creating objects in memory, that is acceptable also)
1.5 seconds is my network latency, due the amont of data.
Javascript and HTML rendering os the other 6.5 seconds.

brian_pond · March 23, 2021, 11:54pm

I wouldn’t call it a Data Warehouse, no. That particular phrase has special meanings and connotations. That’s worth a whole other thread and conversation. In this context, I’d probably call Database C just an “archive” or “staging” database.

Yes, certainly you could create SQL-based reports against Database C. Just by writing some Joins between its tables (historic transactions) against the complete ERPNext database tables.

Regarding front-end reporting tools. On my list of candidates is Knowage in the lead, followed by Pentaho. Those provide a full set of capabilities. If only dashboard widgets and graphs are required, then Cube.js or Grafana are options too.

Edit: I do agree 100% with @max_morais_dmm about the importance of Indexes. They can make a tremendous difference. However, archiving is a good idea for performance reasons, and more.

brian_pond · March 24, 2021, 12:08am

Yes, this would definitely work.

Because you are only performing the Archive Event once per year.
Because you are setting this expectation: “no -single- ERPNext instance will contain the entire dataset.”

Live system only has activity 2021 and greater.
Backup #1 only has activity fiscal year 2020.
Backup #2 only has activity fiscal year 2019.
…and so on.

With your method, each ERPNext can stand alone, and run independently with a complete set of master data.

The disadvantage is no single ERPNext has all the answers. You cannot analyze the entire business, Beginning to Present Date, without exporting reports from each ERPNext instance to Excel. And manually merging them.

bkm · March 24, 2021, 1:49am

Hmm… Good point. So while this plan would work, it may not provide the ability to do data analysis that one would hope to get from archived data.

So, I am back at trying to do some sort of “Database C” solution with a reporting frontend.

Thank you for helping clear the fog. The only real solution is either your original “Option 1” which now seems pretty labor intensive, or some half step toward a simple database to hold the data and using a report package.

~BKM

MartinHBramwell · April 22, 2021, 5:10pm

(Hopefully I haven’t completely missed the point here)

This is a problem with at least a half century history. All modern databases have powerful tools to address it. I see no one here has mentioned “horizontal partitioning”, so I’ll toss this into the discussion …

Horizontal partitioning literally “archives off old data to a separate database”, but links it into your main database so it’s all still available, albeit with different performance.

24.1 Overview of Partitioning in MySQL

Muzzy · June 11, 2021, 9:11pm

Hi @max_morais_dmm I have a db on 8GB RAM, 4 core that is hogging full memory. Prepared reports taking huge time. Not sure how to fix it. Can you share your method for mine and others benefit.

I have very low knowledge of db. Thanks

bkm · June 12, 2021, 12:01am

I am not sure exactly how @max_morais_dmm did it, but I worked through it with a friend and it went something like this: (I assume you are asking about how to setup indexes for overgrown database tables)

First and foremost run a complete backup of your database in case something goes horribly wrong, you have the ability to get back to where you started. Now we can begin.

First find the directory that contains the slow query log. So, start mysql:

mysql -u root -p[your-password]

SELECT  @@datadir;

This will reveal the directory where you will find the slow query log file. In my case it was:

/var/lib/mysql

When I looked in that directory, near the bottom of the files list I found:

mysql-slow.log

So, now that we know where to look and what the file is named, you stop mysql and delete the file, then restart mysql to re-initialize the file:

sudo service mysql stop
sudo rm /var/lib/mysql/mysql-slow.log
ls -lha /var/lib/mysql                 (to make sure the file deleted and is no longer there)
sudo service mysql restart
ls -lha /var/lib/mysql                 (to make sure the file was recreated on restart)

Now we have a running system with a fresh and mostly empty slow query log file. So, go run the reports and do some regular erp functions that you think run slow. After a few exercises, we dump the log file to a text file so it can be easily read:

sudo cat /var/lib/mysql/mysql-slow.log > ~/slowlog.txt

This will create a file called slowlog.txt in your user folder that you can now easily review in any editor, or even transfer down to your local PC and use a local editor to read it.

Now that we have the file, we need to evaluate it a bit looking for the tasks that are taking up the most time. Here is an example from one of my own files:

# Time: 210330  9:11:35
# User@Host: 1bd3e0294da19198[1bd3e0294da19198] @ localhost [127.0.0.1]
# Thread_id: 2316  Schema: 1bd3e0294da19198  QC_hit: No
# Query_time: 10.630705  Lock_time: 0.000096  Rows_sent: 0  Rows_examined: 4756902
# Rows_affected: 0
# Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: Yes  Filesort_on_disk: No  Merge_passes: 0  Priority_queue: Yes
SET timestamp=1617109895;
select name, owner, creation, data from `tabVersion` where `tabVersion`.docname = "STE-54117" and `tabVersion`.ref_doctype = "Stock Entry"
			  order by creation desc limit 0, 10;

In this example we are looking at the “Query_time” of about 10.6(+) seconds and the operation was submitting a stock entry. So after the “SET timestamp” field we see the actual operation and that is where the rest of the clues come from.

We see that it was an operation on the “tabVersion” table that took so long. If you were to look at the other hundreds of entries in my log file you would see that about 85% of them are related to slow operations related to the “tabVersion” table and every one of them was trying to find records sorted by .ref_doctype and then .docname

So the logical thing to do was to create an index for that table to sort by those two references. To do that I went into mysql again and created the index for that table:

mysql -u root -p[your_password]

CREATE INDEX test_index1 ON `tabVersion` (`ref_doctype`, `docname`);

exit

This creates an index named test_index1 for the table tabVersion where it is first sorted by ref_doctype and then docname within ref_doctype to speed up searches.

Now we restart mysql again to make sure everything is ready for testing.

sudo service mysql restart

Now log back into your ERPNext system and run the same tests for things that were running slow before to see if you have a speed improvement.

WARNING - messing with your database is a dangerous thing if you are not sure of yourself. So be warned to make backups before attempting anything and always check you syntax as you go.

Now, I had very little experience with mysql database work but the instructions I received from a friend were very helpful and I just jumped in and tried it. IT worked very well for me.

And as ALWAYS… Your mileage may vary!

BKM

EDIT - If you want a larger datapool to work with and possibly get better results, instead of only running a few tests of the slow functions, try letting the fresh slow query log file capture a day worth of information and then dump it to a text file for evaluation. This will actually give you much more data to tell you how many times a particular table may be interfering with your business during the day.

Sometimes more information is better. But if your system was like mine then the slow query log that existed already was massive (36mb of text) which is why I suggest deleting it and focusing your testing on a few functions or a day worth of work.

BKM

satishvijayan · June 14, 2021, 4:23am

This is an interesting and (potentially) critical topic. We would have to treat it like a fresh installation. The archival process would work based on the Fiscal year cut-off (say March 31st, 2015)

a. Create opening balance entries across all accounting / inventory tables.
b. We would need to look at masters and see if there are references that would impact. Typically, we can leave masters well enough alone
c. All open POs/SOs/JOs/Invoices etc would have to be duplicated for next year (in our example FYE-2016) - with a PO date of 2015.
d. The new look data would be staged on another DB. Some sort of verification step would have to be undertaken A set of reports that we can compare at the end of FYE 2016, pre and post truncation. If the data matches, then we go ahead and truncate the Main db.

tmikaeld · June 14, 2021, 7:32am

It sounds like archiving should be a built-in feature and not different kind of hacks that are easy to mess up.

The “safest” would probably be to store full instance backups from every year.