Text Data Deduplication Cleanup

I have a text-based dataset that needs a focused round of cleaning—specifically, I only want every duplicate record gone while leaving all unique entries untouched. The file is already compiled; it simply contains repeated lines that crept in during previous merges. Here is what I expect: • You load the file (it’s in CSV; I can convert to TXT or Excel if that suits your workflow). • Run a systematic check for perfect and near-perfect text duplicates, case-insensitive. • Return two outputs: 1. A cleaned file containing only unique rows, kept in the original column order and encoding. 2. A short log or report summarizing how many duplicates you identified and removed, with sample lines so I can spot-check your work. Any tool is fine—Python (pandas), Excel, OpenRefine or a favourite script—as long as the result is accurate and I can reproduce it if needed. Turnaround is flexible but faster is better; let me know your ETA when you respond.

Python

Реєстрація