LLM Mixed PDF XML Converter

I need an application that can take batches of mixed PDFs—some purely text-based, others scanned images—and turn each file into a well-structured XML document that validates against XSD files I will supply. The core of the workflow should combine reliable OCR for scanned pages with a large-language-model stage that recognises headings, paragraphs, tables, figures and other logical components before writing them out in the schema-compliant order. Key points to build into the solution • One-click ingestion of individual files or whole folders of PDFs • Automatic detection of whether a page needs OCR (Tesseract/Adobe/Google Vision or similar) • LLM-driven structural analysis that maps the recognised content to the element hierarchy defined in my XSDs • Real-time validation: the app must flag any nodes that fail schema checks before final export • Clear logging so I can trace how each page was processed and why any element was mapped a certain way • Simple configuration pane where I can add a different XSD without touching the code Deliverables 1. Source code with readable comments (Python preferred, but I’m open to other stacks) 2. A command-line interface plus a minimal GUI/Streamlit panel for non-technical use 3. Unit tests and a small sample set showing successful conversion and XSD validation 4. Setup guide covering prerequisites, model keys, and deployment on Windows/Linux Acceptance criteria – All sample PDFs (both text and scanned) convert without manual edits and pass xml ‑-schema using my XSDs – Average page-level accuracy ≥ 95 % on a blind test set I’ll supply at the end – Runtime under 60 s for a 30-page mixed document on a standard laptop If you have prior experience blending OCR, NLP/LLMs (OpenAI, Claude, Llama-2, etc.) and schema-driven XML generation, this will be a straightforward project. Looking forward to seeing how you would architect, train and test the pipeline so that the output is rock-solid and maintainable.

Python

Registration