US Email Scraping Automation

Бюджет: 250 $

I need a robust, repeatable scraping pipeline that will harvest 100,000 + U.S. contacts—first name, last name, email, source URL, and a confidence label—from Facebook, Twitter as the primary source. Scrapy and Python will drive the workflow, with Selenium reserved for any dynamically-loaded pages. Please architect solid proxy / IP rotation, error handling, and deduplication so the final dataset is clean and unique. Source URL is the single most critical accuracy field, so every record must include it. Confidence level and collection date are still expected in the file, but missing Source URL invalidates the row. The script or CLI you deliver will run on demand rather than on a fixed schedule, enabling incremental refreshes whenever I choose. Deliverables • Deduplicated CSV (≥ 100k rows) meeting the field specs above • Well-commented Scrapy/Selenium code with clear instructions for local or cloud VM execution • Command-line interface or script I can rerun for incremental updates, fully documented • Brief technical report covering sources, crawl logic, rate limiting, proxy strategy, error handling, and QA checks Acceptance criteria • All contacts must have U.S. location evidence • ≤ 1 % duplicate rate (measured by email) • Average confidence level medium or higher, with rationale in the report If this matches your expertise in Scrapy, Python, Selenium, and large-scale data collection, I’m ready to review your approach and timeline.

Python

Реєстрація