Cross-Format Text Comparison Tool

I need a small yet reliable program that can read the full contents of a Word document containing complex layouts—text, images, tables—and compare it against the corresponding content found in either an HTML page or a PDF file that is equally rich in formatting. The purpose is to generate a clear, detailed comparison report that tells me where wording diverges, which tables differ, and whether any images or captions have changed. You are free to choose the language and libraries you find most efficient (Python with python-docx, BeautifulSoup4, pdfminer or PyPDF2 is perfectly fine; a C# or Java solution using Apache POI, iText, etc. is equally acceptable). What matters is that the script: • Extracts all textual segments in reading order from both sources, even when they are embedded in tables, text boxes or figure captions. • Ignores purely stylistic discrepancies unless they influence meaning. • Outputs a human-readable report—Markdown, HTML, or an annotated DOCX are all acceptable—summarising identical blocks, modified blocks, additions and deletions. When finished, please hand over the runnable code, a concise README that shows how to install dependencies and execute the comparison, and one sample report generated from your test files so I can see the format immediately.

Реєстрація