I have a bilingual data set of 1,000 sentences that is already translated. What I need now is a clear, defensible linguistic-quality benchmark for those translations. Here’s the core of the job: • Run a linguistic quality evaluation, expressing the results through ROUGE score. • Supply a concise report (tables + brief narrative) that explains what the score means for overall quality and highlights any outliers that deserve human review. • Share the clean, reproducible code you use—Python notebooks or scripts are fine so long as they run end-to-end on my side without extra setup beyond standard packages (e.g., SacreROUGE, NLTK, or Hugging Face tools). I will provide: – The 1,000 source sentences – Their current translations (reference) You return: 1. The ROUGE evaluation script/notebook 2. The scored output file for all 1,000 lines 3. A short read-me or inline comments so I can rerun or adapt the workflow later This is a self-contained task; once I verify that the script reproduces your numbers and the explanatory note is clear, we are done.