Advanced OCR Data Extraction from Scanned Forms

Title: OCR-Based Data Extraction System for Semi-Structured Forms (Python) Description: I am looking for an experienced Python developer to build a high-accuracy OCR-based data extraction system for semi-structured scanned documents. The system should be capable of extracting clean and readable text from noisy images where: * Text may be misaligned * Words may be merged (e.g., teusday11-27-1962) * OCR errors are common * Multiple data fields appear in a single line Core Requirements: 1. OCR Engine: * Use PaddleOCR (preferred) or any advanced OCR (not basic Tesseract unless improved) 2. Image Preprocessing: * Grayscale conversion * Noise removal (Gaussian blur) * Adaptive thresholding * Deskewing (if needed) 3. Text Extraction: * Extract text along with bounding box coordinates * Maintain high accuracy even with poor-quality scans 4. Text Reconstruction: * Group text into proper lines using positional data (Y-axis grouping) * Sort words left-to-right (X-axis sorting) * Rebuild readable sentences 5. Text Cleaning: * Fix OCR errors (e.g., 0/O, 1/I, spelling corrections) * Add missing spaces between words and numbers 6. Output: * Clean, structured, multi-line readable text * No need for full automation of field mapping (manual selection is acceptable) 7. Code Quality: * Modular Python code * Functions like: * preprocess_image() * extract_text() * reconstruct_lines() * clean_text() Nice to Have: * Experience with document AI tools like Google Vision API or Amazon Textract * Handling of forms with 15–25 fields Goal: The final system should significantly improve extraction accuracy compared to basic OCR and produce clean, human-readable output ready for manual use. Deliverables: * Fully working Python code * Sample test results on provided images * Documentation on how to run the system Budget & Timeline: (Open for discussion) Contact : +91 9309835085

Python

Реєстрація