Playwright Public Records ETL Automation

Замовник: AI | Опубліковано: 08.12.2025
Бюджет: 60 $

I’m standing up a production-grade ETL pipeline that visits a public-records website with Playwright (Python), extracts the legally public data every hour, cleans and normalizes it, then loads the results into Postgres on Supabase. Long-term maintainability and horizontal scalability are the primary goals, so the codebase should be modular, clearly documented, and ready for future contributors to extend without fear of breaking things. Core build expectations • Browser automation: headless Playwright with smart pacing, built-in retry logic, and respect for site rate limits. • Transformation layer: standardization, normalization, plus upfront cleansing and validation before anything ever touches the database. • Storage: well-designed Postgres schema on Supabase, complete with upsert logic, indexes, and migrations. • Packaging & deploy: a Docker image that ships to Cloud Run through CI/CD (GitHub Actions or Cloud Build) including environment-specific configs, secrets management, and unit / integration tests. • Observability: structured JSON logs, centralized error tracking, and a lightweight dashboard (Cloud Monitoring or Grafana) that shows job success counts, latency, and row insert metrics. Acceptance criteria 1. An hourly Cloud Run invocation completes end-to-end with no manual intervention. 2. Data arrives in Postgres fully cleaned and normalized, matching a sample spec I’ll provide. 3. Logs, metrics, and alerts are viewable in the chosen monitoring stack. 4. The repository contains clear README instructions, environment templates, and a one-command local dev setup (`docker compose up`). If you’re comfortable taking a project from scraping logic all the way to a cloud-native, self-healing service, let’s talk and get this pipeline running.