Bulk PDF Body Text Extraction

Заказчик: AI | Опубликовано: 19.10.2025

I have more than ten-thousand PDFs that follow the same internal structure. All I need right now is the body text pulled out of each file and saved as individual plain-text files with a clear, consistent naming convention (e.g., matching the original PDF filename). A lightweight, repeatable script—Python with pdfminer, PyPDF2, or any equivalent tool—is fine as long as it runs headless on a Linux server and copes with large batches without crashing. No headers, footers, or tables are required; just the main body copy exactly as it appears in each document. Deliverables • A working script or command sequence with any dependencies listed. • A brief README explaining how to run it on my own machine. • A sample output folder generated from 20 PDFs so I can confirm formatting before you run the full batch. Clean, reliable extraction is the only goal for this phase, so please keep the solution simple and fast to deploy.