Build Data Ingestion Pipeline

Бюджет: 750 $

Senior Data Pipeline / ETL Engineer (FastAPI, PostgreSQL, OpenSearch) – Build MVP Data Ingestion Pipeline for Financial Intelligence Platform Overview We are building a financial intelligence infrastructure platform that aggregates global corporate registries, sanctions lists, and ownership data to produce compliance-grade investigative reports. The GitHub repository, architecture documentation, and core backend services already exist. What we need now is a senior data pipeline engineer who can complete the data ingestion and normalization pipeline so that our MVP feature works perfectly. The primary MVP workflow is: Search a person or company → resolve the entity → check sanctions exposure → reconstruct ownership relationships → produce an evidence-backed report. Your role will be to build the ingestion pipeline that powers this workflow. This is not a greenfield project. You will be working inside an existing architecture and repository. ⸻ What You Will Build You will implement the ETL / data pipeline layer that ingests and prepares structured data for the RealScore screening workflow. The pipeline must support ingestion of: • Global sanctions lists • OFAC SDN • EU Consolidated • UN Sanctions • Corporate registry data • Beneficial ownership data • Structured entity datasets The system must perform: 1. Data Ingestion Automated ingestion jobs that pull structured datasets and load them into PostgreSQL. Requirements: • Idempotent ingestion • SHA-256 checksum tracking • Version tracking for data updates • Retry mechanisms for failed jobs ⸻ 2. Data Normalization Convert raw records into a standardized entity schema. Examples: • normalize company names • remove legal suffixes • standardize jurisdictions • normalize identifiers The output should populate: • entities • identifiers • relationships • evidence_records ⸻ 3. Entity Resolution Implement a multi-pass entity resolution system: Pass 1 — Deterministic • exact identifier matches (LEI, registry IDs) Pass 2 — Semi-deterministic • normalized name match Pass 3 — Probabilistic • fuzzy matching via OpenSearch • Jaro / similarity scoring Goal: resolve duplicate records across datasets. ⸻ 4. Relationship Construction Build ownership and control relationships: Examples: • company → director • company → shareholder • entity → sanctioned entity • entity → related entities Relationships must be stored so they can be used by the risk engine. ⸻ 5. Pipeline Orchestration Pipeline must support: • scheduled ingestion jobs • dependency ordering • failure recovery • logging Suggested tools (already in repo): • Python • FastAPI • PostgreSQL • Celery / Redis • OpenSearch ⸻ Expected Output When a user searches for a person or company: The system must be able to: 1. Resolve the entity 2. Check sanctions exposure 3. Trace ownership relationships 4. Generate a structured evidence report This pipeline is the core engine powering that workflow. ⸻ Existing Stack You will work inside an existing repository that includes: • FastAPI backend • PostgreSQL database • ingestion service skeleton • normalization schemas • entity service • GitHub repo with architecture documentation You will extend the current pipeline, not rebuild from scratch. ⸻ Required Experience Minimum requirements: • 5+ years building production ETL pipelines • Python data engineering experience • PostgreSQL data modeling • Search engines (OpenSearch or Elasticsearch) • Experience with large structured datasets Preferred experience: • sanctions / compliance data • entity resolution systems • corporate registry datasets • financial intelligence or AML systems ⸻ Deliverables You will deliver: • fully functional ingestion pipelines • normalized entity datasets • entity resolution implementation • pipeline orchestration • documentation inside the repo The pipeline must run via: make seed make pipeline and populate the database correctly. ⸻ Important This is a serious engineering role, not a simple script task. We are looking for someone who can think like a data systems architect, not just write quick ETL scripts. Please include in your proposal: 1. Examples of ETL pipelines you have built 2. Your experience with entity resolution 3. Experience with large data ingestion systems 4. Your GitHub profile ⸻ Budget Open to fixed price or milestone structure depending on experience. We prioritize quality and reliability over the lowest bid. ⸻ If you have experience building serious data pipelines and investigative data systems, we would like to hear from you. All proposals must include GitHub links showing your past work specifically with pipelines.

Python

Реєстрація