PDF Voter Data Extract Script

I have a collection of voter-list PDFs that combine searchable text with embedded scanned images. I need a small, well-commented script that can read those mixed-content files, capture every piece of data that appears on each voter entry, and export it to a clean, single-sheet Excel workbook. Key points • Source: multipage voter PDFs containing both text layers and image-only sections. • Data scope: every available field—names, addresses, voter IDs, birthdates, gender, polling details—anything present on the page should be captured. • Output: a standard spreadsheet (.xlsx), one row per voter, plain column names generated automatically by the script. No custom template necessary. What I expect from you 1. A Python script (preferred) that combines PDF text extraction with OCR for image zones. Feel free to leverage pdfplumber, PyPDF2, Camelot, tabula-py, or Tesseract—whatever achieves reliable results. 2. Clear comments explaining each step and any external dependencies. 3. A short README showing setup, command-line usage, and how to point the script at a folder of PDFs. 4. A sample run on at least one PDF I provide, plus the resulting Excel file to verify correct field mapping. The script should handle typical voter-list quirks such as multi-line addresses, occasional blank fields, and non-English characters. If any data point cannot be confidently parsed, log it so I can review later. Delivery timeline is flexible within a few days; quality and clarity are the priority.

Python

Регистрация