Build Fully Automated German B2B Scraping & Data Cleaning System

We need a Python expert to build an end-to-end automated web-scraping and data-processing system for German B2B directories. The goal is to eliminate all manual steps — scraping, cleaning, deduplication, enrichment, and export — within 15 days. You’ll design and implement a high-speed, fault-tolerant pipeline that scrapes ~10 directories, normalizes data (including Umlauts and formatting), removes duplicates, detects updates, and exports clean structured data automatically. ⸻ Core Responsibilities • Develop asynchronous scraping system (Python 3.11+, aiohttp/httpx/Scrapy/Playwright). • Build deduplication & change-detection logic using hash comparison and timestamps. • Design and connect central database (PostgreSQL + SQLite) to store unique company records. • Integrate proxy rotation and throttling (BrightData/Luminati or similar). • Implement data normalization using ftfy, unidecode, python-phonenumbers, regex, and pandas. • Crawl Impressum pages to auto-fill missing fields (phone, fax, website). • Automate daily/weekly export to Excel / CSV using openpyxl. • Add basic monitoring dashboard (Streamlit) showing live progress, proxy health, and logs. • Deliver well-structured, documented, production-ready code. ⸻ Required Skills • Expert in Python web scraping (Scrapy / aiohttp / Playwright / asyncio) • Strong knowledge of PostgreSQL / SQLite, schema design, and deduplication logic • Experience with proxy management and rate limiting • Skilled in data cleaning, parsing, and normalization • Familiarity with incremental scraping / delta detection • Solid understanding of data pipelines and automation • Quick turnaround and ability to deliver under short deadlines ⸻ Deliverables (Within 15 Days) 1. Working automated scraping pipeline for 10 German directories. 2. Deduplication + change-detection module (no re-scraping old data). 3. Proxy rotation with error handling and retry logic. 4. Normalization + cleanup system for all fields (company name, email, phone, address). 5. Data enrichment via Impressum pages (fax/phone/website). 6. Streamlit dashboard with live metrics and logs. 7. Automated Excel/CSV export of clean, unique data. 8. Deployment guide + short documentation. ⸻ Timeline • Total Duration: 15 days • Milestones: • Day 1-3: Environment setup, database schema, proxy integration • Day 4-7: Scraping logic (3–4 directories) • Day 8-11: Deduplication + change detection + data normalization • Day 12-14: Dashboard + automation + exports • Day 15: Final QA + handover ⸻ Budget • Mid-range / negotiable — prefer quality and speed over low cost. • Bonus for early or exceptionally clean delivery. ⸻ To Apply Please include: 1. Links / examples of similar scraping automation systems you’ve built. 2. A short note on how you’d structure this pipeline (async, dedupe, proxies). 3. Your availability for the next 15 days and estimated daily hours. ⸻ Tech Stack (Preferred) • Python 3.11+ • aiohttp / httpx / Scrapy / Playwright • PostgreSQL / SQLite • ftfy, unidecode, pandas, regex, python-phonenumbers • BrightData (Luminati) or equivalent proxy network • Streamlit / Prometheus / Grafana for monitoring • openpyxl for export automation

Python

Реєстрація