Automatic OCR -> doctype fields for PDF invoices/shipping-docs etc

raveslave · September 30, 2023, 3:05pm

Curious if anyone has setup a nice ‘semi-automated’ workflow to populate forms from pdf’s received by suppliers.

Not sure how people are doing it, but manually keying in data from a supplier delivery-note or invoice is time consuming and better done by ML based OCR. By ‘semi’, I mean that the result should be visualised as source data → interpreted field → destination, queued for manual approval.

Lucas · October 1, 2023, 5:06pm

If you already have a OCR and metadata extracting tool, post the data to Erpnext via the API.

sternbj · October 6, 2023, 2:36pm

I would also be very interested in it. There must be something. Nothing out there, not even a paid version?

raveslave · October 10, 2023, 9:42pm

+1, I really would like to see a smart pdf extractor that previews how it would populate any doctype, allowing touch-ups and allowing the training-set to improve over time.

there are soo many great tools to do this now so would be a nice addition to have a native way of doing what a human would do when receiving an invoice or delivery-note via email.

goodhawk · October 11, 2023, 1:07am

I can share my actual use case here:

In China, invoices are formal PDF documents, and the government is gradually transitioning to standardized XML formats. The scenario is as follows:

Suppose you receive 2 PDF invoices. In ERPNext, here’s how I handle OCR processing:

The general workflow is as follows:

Employees receive 2 PDF format invoices.
Employees send the invoices to a designated email address.
The system reads the email data, including the email content and attached invoices, and stores them.
The system uses OCR to recognize the invoice content and stores it.
Finally, the finance team reviews the invoices and generates vouchers in ERPNext.

Technologies involved:

OCR recognition: Python packages like email and pdfminer(another package cnocr also could be considerd).
Frappe-related technologies: Doctype, scheduling email parsing every 5 minutes.

raveslave · May 4, 2024, 3:20pm

waking up this thread. feels like there are more and more “private gpt” alternatives out there that might be a good way to do smart ocr without the need for a 3rd party cloud service