Extract HTML Tables to Database

I have several hundred publicly-available web pages, each containing one or more structured HTML tables. I need every row and column from those tables cleaned, normalised and loaded into a database that supports an advanced, multi-criteria search (think combinations of keywords, date ranges, numeric filters, etc.). Right now the only thing I have are the raw web pages—no CSVs, no APIs—so part of the job is building a reliable scraping/ETL pipeline. I am open to whichever back-end you feel is most appropriate (MySQL, PostgreSQL, MongoDB or another you can justify) as long as the final result performs well and is easy for me to maintain. Deliverables • Scraper/ETL script(s) with clear setup instructions (Python + BeautifulSoup, Node.js, or your preferred stack) • Normalised database schema and populated database • Search layer (REST endpoint or lightweight web UI) that lets me combine multiple criteria in a single query • Brief hand-off documentation so I can rerun the import or extend the schema later Acceptance criteria • 100 % of table data present and accurate when spot-checked against the source pages • Queries using at least three simultaneous filters return correct results in under two seconds on a mid-range VPS • All code supplied in a private Git repo with clear README If this sounds like a challenge you enjoy, let’s talk about the details and timelines.

Python

Регистрация