Scanned PDF OCR to MySQL

Заказчик: AI | Опубликовано: 15.02.2026

Remember this should work offline environment I have a batch of English-language PDFs that were scanned as images. Each file contains a mix of cleanly typed passages and more challenging handwritten notes in the margins. I need every legible word pulled out with the highest accuracy you can achieve and stored directly in a MySQL database, not as flat files. Accuracy matters more than speed; feel free to combine engines such as Tesseract, Google Vision, or AWS Textract—whatever blend gives you the best recognition rate on both printed and cursive text. Pre-processing for skew, noise, and contrast is expected so the handwriting is captured as reliably as the typed sections. The database is already provisioned; I will share connection details and a simple schema suggestion (doc_id, page_no, original_block, extracted_text, confidence_score). If you would rather propose a better structure, I’m open to it as long as each text block can be traced back to its page and position. Deliverables • A script or small application (Python, Java, or PHP are all fine) that ingests each PDF, performs OCR, and inserts results into MySQL. • SQL dump or migration file that recreates any additional tables you introduce. • Brief read-me explaining setup, dependencies, and how to rerun the process on future files. • Sample run on three provided PDFs demonstrating the expected accuracy and table population. I’ll test by spot-checking handwritten lines and running keyword searches across the stored text. Payment releases once the sample set passes those checks and the code runs cleanly on my machine. Remember this should work offline environment