Scholarly Sentence-Length Analysis

I need a concise script or repeatable workflow that scans a batch of academic papers (PDF, DOCX, or plain-text) and reports how often sentences fall into two buckets: short sentences containing fewer than 10 words and long sentences containing more than 20 words. Everything in between can be ignored or grouped as “mid-range”—the focus is squarely on the extremes. Here is what I expect: • Input: a folder of research articles that may vary in format. • Processing: automatic extraction of the body text and measurement of every sentence’s word count. • Output: a clear summary—per document and aggregated—showing counts and percentages of short vs. long sentences, plus optional CSV and simple visualisations (bar chart or histogram) to make patterns obvious. Python with NLTK, spaCy, or a comparable NLP toolkit is perfectly fine, as long as the code is clean, commented, and runnable on a standard machine. I would also like a brief README explaining how to add more files, change the sentence-length thresholds in future, and interpret the results. Once delivered, I will run the script on a small sample set; if the numbers line up with a quick manual check, the task is complete.

Python

Registration