Advanced Invoice Data Extraction System

Title: Invoice data extraction (PDF & noisy images) + 100% accuracy with human-in-the-loop review Brief: I need a robust pipeline to extract structured data from supplier invoices that are not standardized (PDFs, scans, smartphone photos, sometimes skewed/low contrast). The system must reach 100% final accuracy via automatic extraction plus a lightweight human-review screen for uncertainties. Data to extract (per document & per line item): Supplier data (legal name, address, VAT/Tax ID) Customer data (legal name, address, VAT/Tax ID) Product code (the real code on the invoice) Product description snippet (from the code’s line up to before the next product code, multi-line allowed) Quantity purchased Unit of measure Unit price Discounts (any format: %, chained like “10+10”, text notes) Line total (net) VAT code/rate per line Quality & Review (mandatory): The model flags any low-confidence field. A review page shows: left = cropped snippet from the invoice (exact region), right = pre-filled field (editable) with Validate/Correct actions. After review, exported data must be clean JSON/CSV/Excel. Input conditions: PDFs, multipage PDFs, photos (skewed, rotated, shadows). Mixed layouts, multiple columns, line wraps, product descriptions spanning lines, some items without lot/notes. Italian invoices mainly, but layout is vendor-specific (not a single template). What I need from you: Approach: which stack/services (e.g., custom OCR + ML, Document AI/Rossum/AWS Textract + post-processing, open-source alternatives), how you’ll handle layout variance and chained discounts. Prototype: working PoC on my samples (both PDFs and phone images) with the review UI. Costing: Your fixed/estimated fee for the PoC + optional phases to productionize. Ongoing operating costs (SaaS/API pricing, pay-as-you-go vs subscription, expected cost per invoice at different volumes). Delivery: clean code, brief README, and instructions to run locally (Docker preferred) or on my VPS. Success criteria: High auto-extraction accuracy; 100% final accuracy achieved via the review step. Correct parsing of multi-line descriptions and discount formats (including “10+10”). Stable on noisy images (deskew/denoise/contrast/dewarp included). Nice to have (not mandatory): Field-level confidence scores; heatmaps for click-to-zoom on the original region. Vendor-specific learnings that improve over time. Italian language nuances. Please include in your bid: Short description of your relevant experience. Proposed tools/services and why. Timeline estimate for PoC and a ballpark for total cost. Any clarifying questions + a list of sample files you need.

PHP

Регистрация