Hierarchical Entity Resolution: Python/RapidFuzz/MongoDB/AWS

Бюджет: 750 $

!!!!!!!!!!! STRICT FILTER!!!!!!!!!: ONLY candidates with PROVEN expertise will receive a response. You must showcase prior projects or positive reviews in Entity Resolution, Data Matching, or large-scale Data Processing. Candidates without explicit, relevant experience should NOT apply. We are seeking a highly specialized Data Engineer to collaborate on a mission-critical data matching pipeline. As the lead Python and JS developer who built the initial system, I will provide full support and code context. The Core Challenge The goal is to link two large, existing datasets (≈1.5 million total records) that reside on the same MongoDB Atlas cluster. Crucially, in both databases, the hierarchical relationships (Artist → Albums → Tracks) are already completely intact and verified with unique IDs. The core task is to create a complete, non-duplicate mapping by linking the unique IDs across the two datasets, despite inconsistent naming conventions (which necessitates robust fuzzy matching). Required Methodology Hierarchical Cascade: The process must link Artist IDs first, then use that confirmed link to efficiently and accurately cascade matching to the corresponding Album and Track IDs. Scale & Performance: The solution must handle high-volume string comparisons using efficient Blocking strategies and Parallel Processing to meet our performance targets. Entity Creation: The pipeline must identify external entities not in our database and create new, clean internal records for them in the same MongoDB cluster. Technical Stack (Must Have) Engine: Production-ready Python code (for speed and data manipulation). Database: Optimized read/write for MongoDB Atlas (single cluster). Deployment: Deliverable must be a Docker container ready for AWS deployment with CI/CD integration. Acceptance Criteria (Non-Negotiable) The container must process a 100k record sample in under 60 minutes on an m5.large instance, and achieve ≥95% Precision and ≥90% Recall at the track level.

Python

Реєстрація