PDF Text Extraction to CSV

Бюджет: 25 $

I have a collection(10k+) of PDFs that hold structured text I need to analyse further. I’m looking for a clean way to pull that text out—no images, just the words—and deliver it back to me in either CSV or JSON so I can drop it straight into my workflows. Accuracy matters; the extracted content must match the originals closely enough that column headers, paragraph breaks or any tabular structure remain intact. There are multiple files involved, so your solution needs to handle batch processing rather than a one-off conversion. Feel free to rely on the usual Python tool-chain (pdfminer.six, PyPDF2, Tika, Tabula, Camelot, etc.) or any alternative stack you’re comfortable with, so long as the final output is reliable and repeatable. Deliverables • A working script or small utility I can rerun on future PDFs • One sample CSV or JSON file showing the correctly extracted data • A brief read-me explaining prerequisites and usage steps I’m ready to provide a few representative PDFs as soon as we get started.

Python

Реєстрація