Scraping developer with backend API experience and AI experience (Scraping / Go)

Заказчик: AI | Опубликовано: 09.11.2025
Бюджет: 50 $

Mission Build a resilient crawling and enrichment pipeline that ingests public web data at scale, enriches it with AI, stores it cleanly, and exposes it through stable APIs and a usable front end. What you’ll do • Design and ship site-specific and configurable crawlers that survive auth flows, pagination, anti-bot, and layout changes. • Choose the right tool per target: headless browser with Playwright or Puppeteer, HTTP clients, or Scrapy-style spiders. • Implement anti-blocking: rotating residential proxies, session and fingerprint management, CAPTCHA solving, backoff and retries. • Extract full HTML and clean text, normalize and deduplicate, track versions, and persist to Postgres or Mongo with S3-style raw dumps. • Build REST or GraphQL endpoints and lightweight front-end views to query, preview, diff, and QA scraped records. • Add AI enrichment: language detection, summarization, keyword and entity tagging, classification, confidence scoring, and cost controls. • Orchestrate runs on a schedule with queues and workers, add metrics, logs, and alerts, and document how to extend to new domains. • Write tests that validate selectors, schemas, endpoints, and enrichment outputs end to end. Required experience • Strong in at least two of the following: Go, Node.js, Java. Comfortable context switching across back end, front end, and data plumbing. • Production scraping at scale with Playwright or Puppeteer, plus HTML parsing with CSS selectors and XPath. • Proven strategies for avoiding blocks: proxies, cookies and sessions, TLS fingerprinting, headless detection countermeasures, CAPTCHA services. • API design with Express or Fastify, Spring Boot, or Go net/http and friends. • Datastores: Postgres and Mongo, schema design, indexing for query patterns, Redis for queues and caching. • Front end: React or Next.js, TypeScript, component-level data fetching and simple moderation tools. • Containers and CI: Docker, GitHub Actions, basic Kubernetes or ECS, environment secrets, rollbacks. • LLM or NLP integration: OpenAI or comparable APIs, prompt and schema design, latency and cost tradeoffs, JSON output guarantees. • Testing culture: unit tests for extractors, contract tests for APIs, fixtures for DOM changes. Nice to have • Scrapy, Colly, or custom crawler frameworks. • Orchestration with Temporal or Airflow, task queues with BullMQ, Celery, or Sidekiq-like tools. • Search indexing with OpenSearch or Elasticsearch. • Event pipelines with Kafka or SQS, stream processors. • PII handling and compliance awareness, robots.txt and ToS risk assessment. Success criteria • Crawlers run on a schedule, extract defined fields, and self-heal with selector fallbacks and alerting. • Data is deduplicated, versioned, and stored in query-friendly schemas with raw snapshots retained. • Clean JSON is returned from a documented endpoint and consumed by the front end without munging. • A concise runbook and a test suite prove two target sites work end to end, including enrichment. • Cost per thousand pages, block rate, parse error rate, and time-to-fix are tracked and improving. Deliverables • Reusable crawler that can be re-pointed via a simple config file. • Integrated AI enrichment module that outputs structured JSON or CSV. • Deployment instructions for a single-tenant server setup, plus sample runs on at least two sites. • Minimal admin UI for monitoring jobs, reviewing records, and exporting data. Stack you might use here Go or Node for the core workers, Playwright headless browser, Postgres and Mongo, Redis, S3-compatible object storage, Next.js for internal views, OpenAI or equivalent for enrichment, Docker and GitHub Actions, a queue like SQS or RabbitMQ, and an orchestrator like Temporal or a simple cron plus worker model. How to apply Send links that prove it: repos, code samples, or brief write-ups of similar scrapers and pipelines you shipped in production. Include the hardest site you beat, what blocked you, and how you solved it. Share your preferred tooling and earliest start date.