Monthly Web Scraper for EU Institutional Site

Build a monthly web scraper for a European institutional website (projects, good practices, events) I need a scraper that extracts textual content from all pages, PDFs and other documents available in a specific website. These would cover: • Approved projects / project pages (project description and key fields) • Good practices (good practice pages + text fields) • Events (event pages + date/location/description) Requirements: • Python preferred (Scrapy + Requests/BS4; Playwright if needed) • Output as CSV (or database + exports) • Run automatically once per month • Track changes: new/updated/deleted pages since last run • Provide clean code, documentation, and logs • Respect robots/ToS and implement rate limiting Deliverables: • Source code repo • Deployed scheduled job (GitHub Actions / Cloud Run / AWS Lambda / VPS cron) • Example output files • Setup instructions and maintenance notes I need a well-structured Python scraper that harvests the textual content of three sections on an institutional European site: approved project pages (project description + key fields), good practice pages, and event pages (including date, location, description). The crawler must detect and label anything new, updated, or removed since its previous run so that the dataset always reflects the current state of the site. Stack & execution – Please build it with Scrapy and plain Requests; fall back on headless techniques only when unavoidable. – The job will run automatically every month through GitHub Actions, so the repo should contain a workflow file that installs dependencies, executes the crawl, and pushes the fresh export to a dedicated branch or release asset. Data handling – One tidy CSV is the only required export, but architect the pipeline so the data objects could just as easily be dumped in other formats later. – Versioned outputs should indicate the crawl date and summarise counts of added / changed / deleted records in a log. Operational rules – Honour robots.txt and any Terms of Service; throttle politely with an adjustable rate-limit. – Emit clear logs, HTTP error handling, and retry logic. – Include a README covering setup, environment variables, and how to tweak schedules or selectors when the site changes. Deliverables • Git repository with fully commented source code, Scrapy settings, and the GitHub Actions workflow • Example CSV produced from initial crawls • Change-tracking mechanism (diff JSON or similar) demonstrated in that first run • README / maintenance notes explaining deployment, updating selectors, and extending output formats

Python

Регистрация