4 Async Web Scrapers (Auto Parts) - Python/Playwright/BrightData

Бюджет: 3000 $

Budget: $1,800 USD (Fixed Price) Timeline: 7-10 days Tech Stack: Python 3.10+, Playwright (Async), SQLite, Ubuntu VPS Skills Required: Python, Playwright, Web Scraping, Async/Await, Proxies, SQLite ═══════════════════════════════════════════════════════ PROJECT DESCRIPTION ═══════════════════════════════════════════════════════ I need 4 production-ready, high-performance web scrapers for auto parts websites and auction history. The goal is to build a robust data pipeline that runs autonomously on an Ubuntu VPS using Bright Data residential proxies. WEBSITES TO SCRAPE: 1. https://orders.partsmax.com/(Wholesale parts - Login required) 2. https://primeroautoparts.com/ (Wholesale parts - Login required) 3. en.bidfax.info (Auction history - Public) 4. rockauto.com (Retail catalog - Public - Complex navigation) ═══════════════════════════════════════════════════════ TECHNICAL REQUIREMENTS (Non-Negotiable) ═══════════════════════════════════════════════════════ 1. ASYNC/AWAIT ARCHITECTURE - Must use Python asyncio + Playwright - NO Selenium allowed - Clean, maintainable async code 2. CONCURRENCY - Handle 10-30 concurrent browser contexts efficiently - Proper resource management (no memory leaks) - Configurable concurrency limits 3. BANDWIDTH OPTIMIZATION (CRITICAL) - Block images, fonts, CSS, videos, media files using route.abort() - Target: Under 300KB per page load (vs 2-5MB unoptimized) - This directly reduces Bright Data proxy costs by ~80% - Log bandwidth usage per 100 pages for verification 4. DATA INTEGRITY (CRITICAL) - NO direct CSV writing during scraping - Must save to local SQLite database first (prevents data loss on crash) - Database structure: * Each scraper has own database: partsmax.db, primeroautoparts.db, bidfax.db, rockauto.db * Tables include: id (autoincrement), scraped_at (timestamp), all data fields * Checkpoint table: tracks progress (last_vehicle_id, last_page, etc.) - Separate export script: python export.py to convert SQLite to CSV on demand 5. RESILIENCE - Checkpoint System: If script stops at record 5,000 of 10,000, must resume exactly there - Retry Logic: Auto-retry on 500/403 errors or timeouts - Graceful shutdown: SIGTERM should save checkpoint before exit 6. INFRASTRUCTURE - Must run headless on Ubuntu 24.04 VPS - Bright Data proxy integration (credentials provided) - Configurable via .env file - Production-ready error logging ═══════════════════════════════════════════════════════ SPECIFIC CHALLENGES & REQUIRED SOLUTIONS ═══════════════════════════════════════════════════════ 1. ROCKAUTO.COM (The Beast) Challenge: - Complex tree navigation system - "Soft Blocks" - infinite loading bars, empty results after N requests - Aggressive bot detection Required Solution: - Implement browser fingerprinting stealth techniques (playwright-stealth or similar) - Handle dynamic category tree expansion efficiently - Detect and handle soft blocks (wait, retry with new session) - Must NOT open thousands of tabs (memory explosion) 2. PARTSMAX / PRIMEROAUTOPARTS Challenge: - Pricing requires hover interactions or variant selection - Distinguish "List Price" vs "Your Price" (member pricing) Required Solution: - Accurate hover-based data extraction - Handle missing prices gracefully - Extract stock availability 3. BIDFAX Challenge: - High volume pagination (~500,000 records) - Potential rate limiting Required Solution: - Efficient pagination without timeouts - Checkpoint every N pages - Handle network interruptions gracefully ═══════════════════════════════════════════════════════ DELIVERABLES (Per Scraper) ═══════════════════════════════════════════════════════ For EACH of the 4 scrapers: 1. scraper.py - Main asynchronous scraping script 2. models.py - SQLite schema definitions 3. config.py - Centralized configuration (concurrency, timeouts, retries) 4. export.py - SQLite to CSV converter with filters (date range, vehicle type, etc.) 5. checkpoint_viewer.py - Quick script to check current scraping progress 6. requirements.txt - All Python dependencies 7. README.md - Step-by-step setup guide specific to this scraper PLUS Global Deliverables: 8. setup.sh - Bash script to install Python/Playwright/dependencies on fresh Ubuntu VPS 9. global_config.env.example - Template for proxy credentials and global settings 10. Video Walkthrough - 15-20 minute Loom/screen recording explaining: - Code architecture - How to deploy on VPS - How to run each scraper - How to monitor progress - How to handle common errors 11. GitHub Repository: I will create a private GitHub repository and add you as a collaborator. You must push all code directly to the main or dev branch of my repository. This is a requirement for milestone payments ═══════════════════════════════════════════════════════ DATA EXTRACTION REQUIREMENTS ═══════════════════════════════════════════════════════ PARTSMAX Output (SQLite then CSV): year, make, model, description, part_number, your_price, stock, list_price, scraped_at PRIMEROAUTOPARTS Output: Similar structure to PartsMax but with image_url. BIDFAX Output: final_bid, auction, lot_number, sale_date, sale_location, vin, make, model, year, Documents_title, Seller, Primary_Damage, Secondary_Damage, odometer, condition, Estimated_Retail_Value, Transmission, Keys, Fuel, drive, scraped_at ROCKAUTO Output: year, make, model, engine, category, part_type, manufacturer, part_number, price, scraped_at ═══════════════════════════════════════════════════════ PAYMENT MILESTONES ═══════════════════════════════════════════════════════ Milestone 1 (40% - $720) - Day 3-4: - PartsMax + primeroautoparts completed and tested - Both tested with 1,000+ records each - SQLite implementation verified - Bandwidth optimization confirmed (route.abort working) - Code in GitHub repository Milestone 2 (40% - $720) - Day 7-8: - BidFax + RockAuto completed and tested - RockAuto soft-block handling verified with 500+ requests - All 4 scrapers deployed and running on VPS - Checkpoint/resume tested (kill process, restart, verify continuation) Milestone 3 (20% - $360) - Day 10: - All documentation complete - Video walkthrough delivered - Final stress test passed (run all 4 scrapers for 2+ hours) - 7-day support period begins ═══════════════════════════════════════════════════════ WHAT I PROVIDE ═══════════════════════════════════════════════════════ - Dedicated VPS: I will provide a clean Ubuntu 24.04 LTS VPS (DigitalOcean) with 4GB RAM / 2 CPUs. I will provide access via your SSH Public Key. - Bright Data residential proxy credentials (high-quality proxies) - Login credentials for PartsMax and primeroautoparts - Sample vehicle combinations list (CSV with Year/Make/Model) - Quick response time for questions (I'm technical, no hand-holding needed) - Existing reference code (Selenium-based, can provide as context) ═══════════════════════════════════════════════════════ SELECTION CRITERIA ═══════════════════════════════════════════════════════ You MUST have: - Portfolio with 5+ web scrapers (Playwright strongly preferred) - Experience with async Python and concurrent programming - Proxy integration experience (Bright Data, Oxylabs, or similar) - 90%+ job success rate on Freelancer - Ability to start within 24 hours I will immediately reject bids that: - Don't answer ALL screening questions below - Propose using Selenium instead of Playwright - Have no portfolio or generic "I can do this" responses - Bid under $1,200 (indicates you don't understand complexity) - Bid over $2,500 (overpriced for this scope) ═══════════════════════════════════════════════════════ SCREENING QUESTIONS (Must Answer All) ═══════════════════════════════════════════════════════ 1. What specific Playwright method/approach will you use to block images and media to save proxy bandwidth? (Be specific - code snippet preferred) 2. How do you handle RockAuto's "soft blocks" or infinite loading states? What's your detection and recovery strategy? 3. Have you scraped PartsMax, primeroautoParts, or similar wholesale auto parts portals before? If yes, which ones? 4. Describe your exact approach for SQLite checkpoint/resume. What happens if the script crashes at record 5,432 of 10,000? 5. How many concurrent Playwright contexts can you safely run on a 4GB RAM VPS without memory issues? 6. Share a link to your best async web scraper (GitHub or portfolio). What was the scale? (Records scraped, pages/hour, etc.) 7. Are you available to start immediately and deliver in 7-10 days? ═══════════════════════════════════════════════════════ REQUIREMENTS & PROFILE ═══════════════════════════════════════════════════════ I am a technical founder (Licensed Dealer). This is Phase 1 of a larger infrastructure (8+ future scrapers planned). MANDATORY: Async Playwright Only (No Selenium). Must deploy on Ubuntu VPS. Must deliver in 7-10 days. ═══════════════════════════════════════════════════════ TO APPLY (Must Include) ═══════════════════════════════════════════════════════ Answer all 7 screening questions. Link to GitHub repos showing large-scale Async Playwright scrapers. Confirm 24h start and $1,800/10-day terms. Suggest one technical improvement to my approach.

Реєстрація