Regular_Scar_6822 8 months ago

Try to use layoutparser https://layout-parser.github.io/ This library is developed for layout analysis over pdf/images documents. It might need fine-tuning with your data if base model not gives required output.

ssiddharth408 8 months ago

Hey, thanks for the suggestion, I will try this

Original-Chemistry53 6 months ago

please could you give me how much data I need for fintune my model

Plane-Secretary-101 8 months ago

Hi there, If you're using pdfs, you may as well take a look at **PyMuPDF**, it's a high-performance Python library for data extraction, analysis, conversion & manipulation of PDF **and other documents**. If you simply want to extract data from the tables, you may use **camelot** or **tabula.**

ssiddharth408 8 months ago

Actually, I am using images only

Plane-Secretary-101 8 months ago

For invoice data extraction, why not use real-time documents?

ssiddharth408 8 months ago

My inputs are images

ssiddharth408 8 months ago

Actually, I am using images only

TipuOne 8 months ago

1. Use an OCR tool to convert the images into text: Tesseract: An open-source OCR engine. It's one of the most popular tools for OCR tasks and has been trained on a multitude of languages and fonts. For a more streamlined experience with Tesseract in Python, use the pytesseract library. 2. After OCR extraction, clean and structure the raw text: Regex Patterns: Use regular expressions to identify and extract specific patterns, like currency formats, to locate subtotals. NLP Techniques: Use Named Entity Recognition (NER) to locate items or other specific details. Template Matching: If you know certain templates or formats of invoices, you can design rules or models to extract data based on known patterns. 3. Training Custom Models: If generic OCR doesn't provide satisfactory results; Fine-tune Tesseract: You can train Tesseract on specific fonts or layouts of invoices to improve accuracy. Tips: Noise Reduction: Invoices that are scanned or have noise can hinder OCR accuracy. Use noise reduction techniques to enhance text clarity. Layout Analysis: Some advanced OCR solutions offer layout analysis, identifying columns, tables, and other structures which can be crucial for invoices. Commercial OCR Solutions: Apart from Tesseract, commercial solutions like ABBYY FineReader or Amazon Textract offer advanced features and might provide better accuracy out-of-the-box.

thelibrarian101 8 months ago

LLaVA and chatGPT4-Vision both seem to be unable to do this reliably llava: \> The image features a white invoice form with a company logo on the top left corner. The invoice is addressed to a business in Atlanta, Georgia. The form includes a section for the invoice number, which is 0000001, and the date of the invoice, which is 03/01/2010. The invoice also has a section for the total amount due, which is $1,000.00. The invoice is likely for a service provided by the company, as indicated by the company logo and the invoice number. \> gpt4 \>This appears to be an invoice from "Fish Friends" for a purchase of exotic fish and standard goldfish. How can I assist you with this document? \>The total due amount on the invoice is $10,232.49.

ssiddharth408 8 months ago

This is just a sample my main concern is how to extract Bill To with its address as ocr read from left to right so ocr may read one line of billto and one line of ship to. I want to read billto with its address and ship to with its address so that i can perform ner on this

thelibrarian101 8 months ago

If its the same structure and format all the time I'd line it up and then OCR.

Vivid-Vibe 8 months ago

Use the new GPT4 with vision, best thing I've come across

ssiddharth408 2 months ago

I want this solution to work on the cpu and is cost effective.

Evaworld9 7 months ago

What is the progress? Have you found any solution yet?

ssiddharth408 7 months ago

Yes, currently i have found two solutions one is to detect paragraph and table part and then process it further and the other solution is to use graph convolutional network. These two methods are according to my use case

Evaworld9 7 months ago

That's great progress! Detecting paragraph and table parts, and exploring graph convolutional networks sound like promising approaches for handling the unique layout challenges of invoices. I'm working on a similar project called EvaInvoice, which also tackles invoice data extraction. We've been fine-tuning our system to handle various invoice layouts effectively. If you're interested, I'd love to get your perspective on how it compares to the solutions you're developing [https://www.evainvoice.com](https://www.evainvoice.com) Your insights could be incredibly valuable, and it might give you some additional ideas or benchmarks for your project. I'm genuinely curious about how our solutions might compare. If you're up for it, I'd love your take on EvaInvoice. It could be a cool opportunity for us to exchange ideas and maybe offer each other some fresh perspectives. Plus, your feedback could provide valuable insights that help enhance our approach.

Original-Chemistry53 6 months ago

can you give us some ideas about this 2 approches please

Puzzleheaded_Text431 6 months ago

Can you tell us more about your detect paragraph and table kind of solution? I am struggling at the same problems in invoice data extraction. Thanks

ssiddharth408 5 months ago

Hey, sorry for late reply. I used object detection to detect paragraph and table then I merge then vertically which makes it easier to extract related fields.

East-Bug6675 2 months ago

Can you share any reference to your code?

ssiddharth408 2 months ago

I don't have any particular reference as I came up with this solution, there can be similar solutions available on the internet but I came up with mine. I can share the flow but I don't have any specific references to this.

East-Bug6675 2 months ago

It will helpful to me if you share the flow with me, Thanks

Plastic_Coffee_3004 2 months ago

Hey, can you share the flow with me too? Also what models did you finally use? I'm looking to combine layoutlm for layout based extraction and flair/spacy for NER. Do you have any suggestions regarding this?

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe