UI-Based News Scraper Needed

I’m kicking off a series of small data-collection projects and want to start with a focused, UI-driven scraper for news sites. The goal of this first engagement is simple: pull complete editorial content and store it cleanly so I can query it later without touching the source pages again. Scope of the first task • Target: one or two public news sites (I’ll share the URLs once we start). • Data points: – Headlines – Full article bodies – Any PDFs embedded or linked in the article pages • Destination: insert everything into my database (I’m flexible on MySQL, PostgreSQL, or SQLite as long as the schema is documented). Technical notes I’m leaning toward a browser-automation approach—think Selenium, Playwright, or Puppeteer—because some targets rely on dynamic rendering. If a lighter solution such as Scrapy plus BeautifulSoup can achieve the same reliability, I’m open to that, but the toolchain must handle JavaScript-driven content and pagination. Deliverables for this pilot 1. A runnable script or small GUI that launches the scrape, lets me choose the target site, and shows live progress. 2. Well-structured code with comments so I can extend it to additional outlets. 3. A populated database file (or connection string and schema) proving the extraction works end-to-end. 4. A brief readme covering setup, dependencies, and how to point the scraper at a new domain. If everything runs smoothly, I have more sites and features queued up, so consider this the first milestone of a longer collaboration. Let me know what stack you prefer, how you’d tackle dynamic content, and an estimated turnaround for the pilot.

Python

Реєстрація