Extend OCR for Telugu Voter Lists

I have a working Python pipeline that already pulls clean text from scanned voter lists in Tamil and English by combining custom image pre-processing, a light AI layer, and Tesseract OCR. The next milestone is to make the very same code read Telugu with comparable performance—my target is 99 % character-level accuracy across the entire page, not just names or voter IDs. Once Telugu is solid, we will roll the same approach out to the rest of the major Indian scripts (Hindi, Bengali, Marathi, Malayalam, Kannada, Assamese, Gujarati, Punjabi and Odiya), but this job is strictly about nailing Telugu first. What you’ll work with • Current codebase (Python, OpenCV, pytesseract, a few custom TensorFlow helpers) • A curated set of high-resolution scanned PDFs and images of Telangana and Andhra Pradesh voter rolls for training / validation • My existing language-agnostic pre- and post-processing modules, which you are free to tweak Key responsibilities 1. Train or fine-tune a Tesseract language data set (or an alternative open-source OCR engine if it yields better accuracy) for printed Telugu voter-list fonts. 2. Integrate the new language file into the existing code, keeping the same API and CLI behaviours. 3. Validate against my test suite and push accuracy to ≥99 % on a per-character basis; document any edge-case failures and patches. 4. Hand over updated code, trained data files, and a concise technical note explaining changes and future-language scaling steps. Acceptance criteria • ≥99 % per-character accuracy on the provided blind test batch • Same or faster processing speed than the current Tamil run • telugu code will be a separate version, the same code need not read Tamil House number accuracy is extremely important I will prioritise freelancers who can point me to prior OCR/Tesseract projects in Indian scripts and explain, in a few lines, how they usually drive accuracy past the 95 % mark.

Python

Registration