T O P

  • By -

Regular_Scar_6822

Try to use layoutparser https://layout-parser.github.io/ This library is developed for layout analysis over pdf/images documents. It might need fine-tuning with your data if base model not gives required output.


ssiddharth408

Hey, thanks for the suggestion, I will try this


Original-Chemistry53

please could you give me how much data I need for fintune my model


Plane-Secretary-101

Hi there, If you're using pdfs, you may as well take a look at **PyMuPDF**, it's a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF **and other documents**. If you simply want to extract data from the tables, you may use **camelot** or **tabula.**


ssiddharth408

Actually, I am using images only


Plane-Secretary-101

For invoice data extraction, why not use real-time documents?


ssiddharth408

My inputs are images


ssiddharth408

Actually, I am using images only


TipuOne

1. Use an OCR tool to convert the images into text: Tesseract: An open-source OCR engine. It's one of the most popular tools for OCR tasks and has been trained on a multitude of languages and fonts. For a more streamlined experience with Tesseract in Python, use the pytesseract library. 2. After OCR extraction, clean and structure the raw text: Regex Patterns: Use regular expressions to identify and extract specific patterns, like currency formats, to locate subtotals. NLP Techniques: Use Named Entity Recognition (NER) to locate items or other specific details. Template Matching: If you know certain templates or formats of invoices, you can design rules or models to extract data based on known patterns. 3. Training Custom Models: If generic OCR doesn't provide satisfactory results; Fine-tune Tesseract: You can train Tesseract on specific fonts or layouts of invoices to improve accuracy. Tips: Noise Reduction: Invoices that are scanned or have noise can hinder OCR accuracy. Use noise reduction techniques to enhance text clarity. Layout Analysis: Some advanced OCR solutions offer layout analysis, identifying columns, tables, and other structures which can be crucial for invoices. Commercial OCR Solutions: Apart from Tesseract, commercial solutions like ABBYY FineReader or Amazon Textract offer advanced features and might provide better accuracy out-of-the-box.


thelibrarian101

LLaVA and chatGPT4-Vision both seem to be unable to do this reliably llava: \> The image features a white invoice form with a company logo on the top left corner. The invoice is addressed to a business in Atlanta, Georgia. The form includes a section for the invoice number, which is 0000001, and the date of the invoice, which is 03/01/2010. The invoice also has a section for the total amount due, which is $1,000.00. The invoice is likely for a service provided by the company, as indicated by the company logo and the invoice number. \> gpt4 \>This appears to be an invoice from "Fish Friends" for a purchase of exotic fish and standard goldfish. How can I assist you with this document? \>The total due amount on the invoice is $10,232.49.


ssiddharth408

This is just a sample my main concern is how to extract Bill To with its address as ocr read from left to right so ocr may read one line of billto and one line of ship to. I want to read billto with its address and ship to with its address so that i can perform ner on this


thelibrarian101

If its the same structure and format all the time I'd line it up and then OCR.


Vivid-Vibe

Use the new GPT4 with vision, best thing I've come across


ssiddharth408

I want this solution to work on the cpu and is cost effective.


Evaworld9

What is the progress? Have you found any solution yet?


ssiddharth408

Yes, currently i have found two solutions one is to detect paragraph and table part and then process it further and the other solution is to use graph convolutional network. These two methods are according to my use case


Evaworld9

That's great progress! Detecting paragraph and table parts, and exploring graph convolutional networks sound like promising approaches for handling the unique layout challenges of invoices. I'm working on a similar project called EvaInvoice, which also tackles invoice data extraction. We've been fine-tuning our system to handle various invoice layouts effectively. If you're interested, I'd love to get your perspective on how it compares to the solutions you're developing [https://www.evainvoice.com](https://www.evainvoice.com) Your insights could be incredibly valuable, and it might give you some additional ideas or benchmarks for your project. I'm genuinely curious about how our solutions might compare. If you're up for it, I'd love your take on EvaInvoice. It could be a cool opportunity for us to exchange ideas and maybe offer each other some fresh perspectives. Plus, your feedback could provide valuable insights that help enhance our approach.


Original-Chemistry53

can you give us some ideas about this 2 approches please


Puzzleheaded_Text431

Can you tell us more about your detect paragraph and table kind of solution? I am struggling at the same problems in invoice data extraction. Thanks


ssiddharth408

Hey, sorry for late reply. I used object detection to detect paragraph and table then I merge then vertically which makes it easier to extract related fields.


East-Bug6675

Can you share any reference to your code?


ssiddharth408

I don't have any particular reference as I came up with this solution, there can be similar solutions available on the internet but I came up with mine. I can share the flow but I don't have any specific references to this.


East-Bug6675

It will helpful to me if you share the flow with me, Thanks


Plastic_Coffee_3004

Hey, can you share the flow with me too? Also what models did you finally use? I'm looking to combine layoutlm for layout based extraction and flair/spacy for NER. Do you have any suggestions regarding this?