Auction Data Scraper & Database

Бюджет: 250 $

I need a full-scale scraper that captures every layer of information offered by iaai.com and copart.com, then pipes it into a well-structured PostgreSQL database I can build analytics on. Scope of data The feed must span historical records, today’s listings (including real-time bid movement during live sales), and upcoming auctions. For each entry I expect the following fields to be filled: Bid amounts, Vehicle details, Auction dates, Images links, and Auction status. Photo and video references should be stored as direct URLs so that front-end tools can load them instantly without additional processing. Core requirements • Continuous collection: the system should poll frequently enough to keep “current” lots accurate while an auction is in session. • Back-fill: pull all historical data available on both sites so trend analysis starts on day one. • Resilience: both platforms employ anti-bot measures, so the code must rotate IPs, manage cookies, and solve or bypass CAPTCHAs where legally permissible. • Normalised schema: design the Postgres tables so historical, live, and upcoming lots coexist without duplication yet remain easy to query. • Idempotent updates: reruns should update records rather than insert duplicates. • Media handling: verify every saved image/video URL is reachable. Deliverables 1. PostgreSQL schema (DDL) plus a populated sample dump covering at least one full auction day from each site. 2. Scraper/ETL code with instructions to schedule it (cron, systemd, or Docker). 3. README documenting setup, environment variables, and expected runtime. 4. Quick validation script that returns the latest live bid for a given lot ID. Acceptance criteria • Running the ETL end-to-end on my server populates all specified fields with no missing columns. • Querying the same lot twice a minute during a live auction shows changing bid amounts within 10 seconds of the website. • All image/video links resolve with HTTP 200. Preferred stack is Python (Scrapy or Playwright), but I’m open to alternatives if you can meet the real-time requirement. Please outline your proposed approach, timeline, and any prior experience dealing with high-volume, anti-scraping environments on submission.

Python

Регистрация