Automate Y6 PDF Extraction

Бюджет: 25 $

I need a reliable way to pull two recurring tables—Table 3: Shareholders and Table 4: Directors & Officers—from every Y6 report filed by roughly 6,000 U.S. banks over the last 12 years. All source files are PDFs, and while the layout of these tables is generally the same, a few banks do shift column order or add footnotes, so any workflow must tolerate minor structural changes without breaking. So far I have experimented with NotebookLM to parse a handful of samples and can share that early code, sample outputs, and a small annotated training set with you on day one. I am open to any stack you prefer—Python, R, Power Query, Apache Tika, Tabula, Camelot, OCR companions like Tesseract, or a completely different AI pipeline—provided it reliably converts each table into accurate rows that drop straight into Excel (CSV or XLSX). Checking the transfer from PDF to Excel for accuracy is crucial. What I’m after • Reusable script or notebook that locates and extracts the two tables from every PDF in a batch folder • Clean data file for each reporting year with bank identifier, year, and all table fields preserved • Log or flag file of any PDFs that fail, plus a quick note on why (e.g., scanned image, missing table, unusual format) • Brief read-me so I can rerun the process as new reports arrive I will hand over the existing code, sample PDFs, and our folder structure as soon as we start. If you can suggest smarter pattern recognition, templating, or post-extraction validation that reduces manual touch, all the better—speed and repeatability matter more than shiny UI.

Python

Реєстрація