Bulk PDF Body Text Extraction

I have more than ten-thousand PDFs that follow the same internal structure. All I need right now is the body text pulled out of each file and saved as individual plain-text files with a clear, consistent naming convention (e.g., matching the original PDF filename). A lightweight, repeatable script—Python with pdfminer, PyPDF2, or any equivalent tool—is fine as long as it runs headless on a Linux server and copes with large batches without crashing. No headers, footers, or tables are required; just the main body copy exactly as it appears in each document. Deliverables • A working script or command sequence with any dependencies listed. • A brief README explaining how to run it on my own machine. • A sample output folder generated from 20 PDFs so I can confirm formatting before you run the full batch. Clean, reliable extraction is the only goal for this phase, so please keep the solution simple and fast to deploy.

Python

Реєстрація