Word Docs to Structured JSON

Заказчик: AI | Опубликовано: 11.11.2025

I have a collection of Microsoft Word documents (.docx) that I need converted into clean, well-structured JSON files. The only element I care about pulling from each file is the text content; images, metadata, and any other embedded objects can be ignored. You’re free to choose the stack you’re most comfortable with—Python (docx4python / python-docx or similar), Node.js, Pandoc, or even a lightweight CLI utility—as long as the end result is accurate and reproducible. My main priority is that every word in the original document appears in the JSON output in the same order it does in the source file. If you have a preferred way to break the text into logical chunks (paragraphs, headings, etc.) let’s discuss and decide on the best structure together before you start coding. Deliverables • A script or small app that takes one or more .docx files as input and outputs a corresponding .json file for each. • The generated JSON for a short test set of documents I’ll provide. • Brief instructions so I can run the process myself on future files (command-line or GUI is fine). Once I confirm the JSON matches the source content and the tool runs on my machine without extra dependencies, the job is done.