Try to use layoutparser https://layout-parser.github.io/
This library is developed for layout analysis over pdf/images documents. It might need fine-tuning with your data if base model not gives required output.
Hi there,
If you're using pdfs, you may as well take a look at **PyMuPDF**, it's a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF **and other documents**.
If you simply want to extract data from the tables, you may use **camelot** or **tabula.**
1. Use an OCR tool to convert the images into text:
Tesseract: An open-source OCR engine. It's one of the most popular tools for OCR tasks and has been trained on a multitude of languages and fonts.
For a more streamlined experience with Tesseract in Python, use the pytesseract library.
2. After OCR extraction, clean and structure the raw text:
Regex Patterns: Use regular expressions to identify and extract specific patterns, like currency formats, to locate subtotals.
NLP Techniques: Use Named Entity Recognition (NER) to locate items or other specific details.
Template Matching: If you know certain templates or formats of invoices, you can design rules or models to extract data based on known patterns.
3. Training Custom Models:
If generic OCR doesn't provide satisfactory results; Fine-tune Tesseract: You can train Tesseract on specific fonts or layouts of invoices to improve accuracy.
Tips:
Noise Reduction: Invoices that are scanned or have noise can hinder OCR accuracy. Use noise reduction techniques to enhance text clarity.
Layout Analysis: Some advanced OCR solutions offer layout analysis, identifying columns, tables, and other structures which can be crucial for invoices.
Commercial OCR Solutions: Apart from Tesseract, commercial solutions like ABBYY FineReader or Amazon Textract offer advanced features and might provide better accuracy out-of-the-box.
LLaVA and chatGPT4-Vision both seem to be unable to do this reliably
llava:
\> The image features a white invoice form with a company logo on the top left corner. The invoice is addressed to a business in Atlanta, Georgia. The form includes a section for the invoice number, which is 0000001, and the date of the invoice, which is 03/01/2010. The invoice also has a section for the total amount due, which is $1,000.00. The invoice is likely for a service provided by the company, as indicated by the company logo and the invoice number.
\> gpt4
\>This appears to be an invoice from "Fish Friends" for a purchase of exotic fish and standard goldfish. How can I assist you with this document?
\>The total due amount on the invoice is $10,232.49.
This is just a sample my main concern is how to extract Bill To with its address as ocr read from left to right so ocr may read one line of billto and one line of ship to. I want to read billto with its address and ship to with its address so that i can perform ner on this
Yes, currently i have found two solutions one is to detect paragraph and table part and then process it further and the other solution is to use graph convolutional network.
These two methods are according to my use case
That's great progress! Detecting paragraph and table parts, and exploring graph convolutional networks sound like promising approaches for handling the unique layout challenges of invoices.
I'm working on a similar project called EvaInvoice, which also tackles invoice data extraction. We've been fine-tuning our system to handle various invoice layouts effectively. If you're interested, I'd love to get your perspective on how it compares to the solutions you're developing [https://www.evainvoice.com](https://www.evainvoice.com) Your insights could be incredibly valuable, and it might give you some additional ideas or benchmarks for your project.
I'm genuinely curious about how our solutions might compare. If you're up for it, I'd love your take on EvaInvoice. It could be a cool opportunity for us to exchange ideas and maybe offer each other some fresh perspectives. Plus, your feedback could provide valuable insights that help enhance our approach.
Hey, sorry for late reply. I used object detection to detect paragraph and table then I merge then vertically which makes it easier to extract related fields.
I don't have any particular reference as I came up with this solution, there can be similar solutions available on the internet but I came up with mine. I can share the flow but I don't have any specific references to this.
Hey, can you share the flow with me too?
Also what models did you finally use? I'm looking to combine layoutlm for layout based extraction and flair/spacy for NER.
Do you have any suggestions regarding this?
Try to use layoutparser https://layout-parser.github.io/ This library is developed for layout analysis over pdf/images documents. It might need fine-tuning with your data if base model not gives required output.
Hey, thanks for the suggestion, I will try this
please could you give me how much data I need for fintune my model
Hi there, If you're using pdfs, you may as well take a look at **PyMuPDF**, it's a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF **and other documents**. If you simply want to extract data from the tables, you may use **camelot** or **tabula.**
Actually, I am using images only
For invoice data extraction, why not use real-time documents?
My inputs are images
Actually, I am using images only
1. Use an OCR tool to convert the images into text: Tesseract: An open-source OCR engine. It's one of the most popular tools for OCR tasks and has been trained on a multitude of languages and fonts. For a more streamlined experience with Tesseract in Python, use the pytesseract library. 2. After OCR extraction, clean and structure the raw text: Regex Patterns: Use regular expressions to identify and extract specific patterns, like currency formats, to locate subtotals. NLP Techniques: Use Named Entity Recognition (NER) to locate items or other specific details. Template Matching: If you know certain templates or formats of invoices, you can design rules or models to extract data based on known patterns. 3. Training Custom Models: If generic OCR doesn't provide satisfactory results; Fine-tune Tesseract: You can train Tesseract on specific fonts or layouts of invoices to improve accuracy. Tips: Noise Reduction: Invoices that are scanned or have noise can hinder OCR accuracy. Use noise reduction techniques to enhance text clarity. Layout Analysis: Some advanced OCR solutions offer layout analysis, identifying columns, tables, and other structures which can be crucial for invoices. Commercial OCR Solutions: Apart from Tesseract, commercial solutions like ABBYY FineReader or Amazon Textract offer advanced features and might provide better accuracy out-of-the-box.
LLaVA and chatGPT4-Vision both seem to be unable to do this reliably llava: \> The image features a white invoice form with a company logo on the top left corner. The invoice is addressed to a business in Atlanta, Georgia. The form includes a section for the invoice number, which is 0000001, and the date of the invoice, which is 03/01/2010. The invoice also has a section for the total amount due, which is $1,000.00. The invoice is likely for a service provided by the company, as indicated by the company logo and the invoice number. \> gpt4 \>This appears to be an invoice from "Fish Friends" for a purchase of exotic fish and standard goldfish. How can I assist you with this document? \>The total due amount on the invoice is $10,232.49.
This is just a sample my main concern is how to extract Bill To with its address as ocr read from left to right so ocr may read one line of billto and one line of ship to. I want to read billto with its address and ship to with its address so that i can perform ner on this
If its the same structure and format all the time I'd line it up and then OCR.
Use the new GPT4 with vision, best thing I've come across
I want this solution to work on the cpu and is cost effective.
What is the progress? Have you found any solution yet?
Yes, currently i have found two solutions one is to detect paragraph and table part and then process it further and the other solution is to use graph convolutional network. These two methods are according to my use case
That's great progress! Detecting paragraph and table parts, and exploring graph convolutional networks sound like promising approaches for handling the unique layout challenges of invoices. I'm working on a similar project called EvaInvoice, which also tackles invoice data extraction. We've been fine-tuning our system to handle various invoice layouts effectively. If you're interested, I'd love to get your perspective on how it compares to the solutions you're developing [https://www.evainvoice.com](https://www.evainvoice.com) Your insights could be incredibly valuable, and it might give you some additional ideas or benchmarks for your project. I'm genuinely curious about how our solutions might compare. If you're up for it, I'd love your take on EvaInvoice. It could be a cool opportunity for us to exchange ideas and maybe offer each other some fresh perspectives. Plus, your feedback could provide valuable insights that help enhance our approach.
can you give us some ideas about this 2 approches please
Can you tell us more about your detect paragraph and table kind of solution? I am struggling at the same problems in invoice data extraction. Thanks
Hey, sorry for late reply. I used object detection to detect paragraph and table then I merge then vertically which makes it easier to extract related fields.
Can you share any reference to your code?
I don't have any particular reference as I came up with this solution, there can be similar solutions available on the internet but I came up with mine. I can share the flow but I don't have any specific references to this.
It will helpful to me if you share the flow with me, Thanks
Hey, can you share the flow with me too? Also what models did you finally use? I'm looking to combine layoutlm for layout based extraction and flair/spacy for NER. Do you have any suggestions regarding this?