Hierarchical Music Entity Resolver (Python/MongoDB/AWS)

Бюджет: 750 $

We are seeking a seasoned Entity Resolution (ER) Specialist to collaborate on building a high-performance matching pipeline for our music metadata. As the lead Python and JS developer who built the initial system, I will provide full support and infrastructure context. Project Scope & Core Challenge The system must link records between two existing databases (our internal MongoDB Atlas cluster and an external source like Genius). Crucially, in both databases, the hierarchical relationships (Artist → Albums → Tracks) are already established, and every entity has a unique ID. The task is to build a process that leverages this existing structure: Phase 1 (Core ER): Accurately link Artist IDs between the two databases. Phase 2 (Cascade): Use the confirmed Artist links to efficiently and accurately cascade the matching process to the corresponding Album IDs and Track IDs. Scale: The pipeline must efficiently handle ≈1.5 million tracks and identify/create new internal entities for millions of external records not yet in our database. Database: Read and write operations must be optimized for MongoDB Atlas. Technical Methodology & Collaboration The matching core must be robust and high-speed: ER Algorithms: The core should leverage the Record Linkage framework, supplemented by powerful string similarity techniques like RapidFuzz or equivalent methods for optimal precision. Performance: I expect highly efficient blocking/indexing to ensure speed. Modularity: The logic must be clean and modular, allowing us (as developers) to easily tune weights, thresholds, and candidate generators in the future. Infrastructure & Acceptance Criteria The solution must be production-ready for our existing environment. Deployment Stack: Deliverables must be encapsulated in a Docker container ready for deployment to our AWS environment and integrated neatly into our existing CI/CD flow. Acceptance Criteria: A job is complete when the container processes a provided sample of 100k records in under 60 minutes on an m5.large, and returns ≥95% precision and ≥90% recall at the track level. Deliverables Production-ready Python project (PEP 8 compliant). Dockerfile and compose file that build and run locally and in AWS. Link Map collection written back to MongoDB with confidence scores. Brief validation report summarising precision/recall on a held-out set. Timeline: We are ready to move immediately. Please respond only if you can begin right away and meet the ASAP delivery timeline.

Python

Реєстрація