Website Text Scraper & Loader

I need a small service that can pull text from a specific public website, shape that content into a clean JSON structure, and push it straight into my database every five minutes. Core flow • Fetch only the on-page text I specify (headings, body copy, metadata). • Convert the grab into a well-formed JSON payload that matches my table schema. • Insert or upsert the record set so I never create duplicates. Technical notes The site uses no login and loads its content server-side, so a lightweight stack such as Python + Requests/BeautifulSoup, Node + Cheerio, or comparable tools should be enough. If a headless browser becomes necessary, I’m open to Puppeteer or Selenium. The important thing is reliability and speed because of the five-minute interval. What I’ll test before sign-off 1. A single run populates the correct tables with correctly mapped fields. 2. The scheduler triggers automatically every five minutes and logs each run. 3. Error handling skips or retries bad pages without halting the loop. 4. Setup instructions let me redeploy on a fresh server in under 10 minutes. Down the road I may extend the same pipeline to images or videos, so designing the code to allow new parsers would be appreciated, but for now the deliverable is strictly text → JSON → DB.

Python

Реєстрація