Automated Contact Info Scraper

Бюджет: 1500 $

We are building an automated system to collect contact information (name, email, phone number) for art teachers from U.S. K-12 school and district websites at scale. Objective: Given a large list of school or district website URLs (tens of thousands), the system should: 1. Crawl each site in a controlled, polite manner. 2. Discover relevant staff/contact pages (e.g., “Staff,” “Directory,” “Fine Arts,” “Art,” “Departments,” “Contact”). 3. Extract structured contact information including: - Name - Role/Title - Email address - Phone number 4. Identify and prioritize art teachers or fine arts department contacts specifically. 5. Output results in structured format (CSV / database-ready format). Functional Requirements: - Input: Text file, CSV, or database containing school/district URLs. - For each URL: - Fetch homepage. - Discover and crawl relevant internal pages (limited depth, domain-restricted). - Extract emails (including mailto: links and visible emails). - Extract phone numbers. - Parse name/title associations where available. Role filtering: Prioritize titles containing keywords such as: “Art,” “Fine Arts,” “Visual Arts,” “Art Teacher,” etc. We specifically only want to contact the art teacher, and there should be functionality to target those positions directly. Fallback logic: If no individual art teacher is found, capture department-level contact (e.g., fineartsATdistrict.k12.xx.us). This is less important, but would be a nice additional function. Technical Requirements Must handle thousands to tens of thousands of domains at scale. Many email addresses may be obfuscated (such as mailto:, or other browser techniques). We should have a method to decipher as many of those methods as possible. Many school sites use CMS platforms with inconsistent layouts. Must include: - Per-domain request throttling (polite crawling). - Proper request headers (browser-like User-Agent). - Retry logic with backoff. - Graceful handling of 401 / 403 / 429 / 503 responses. Should log: - Status codes - Errors - Lockouts or CAPTCHA responses - Must be modular and extensible. Ideally built in Python (open to recommendations). We may need escalation to headless browser (Playwright) for JavaScript-rendered staff directories at a later stage, with ability to run distributed (cloud-ready) and optional proxy support if needed at scale. We are currently looking to have an MVP system built that will allow us to iterate later in-house.

Python

Регистрация