Hybrid Scraper for DEF 14A Statements

I have roughly 5,000 DEF 14A proxy statements in HTML format and I need the key compensation details for each named executive pulled out and placed into a clean, structured file. The fields I must end up with are: base salary, stock options and awards, bonuses / incentive pay, plus any other compensation figures that appear in the summary or grants tables. Because the data are scattered in both narrative text blocks and embedded HTML tables, a purely scripted scrape misses too much, while a purely manual effort would be too slow. I’m therefore looking for a balanced workflow that blends solid Python-based parsing (BeautifulSoup, pandas, regex, maybe an LLM call for tricky passages) with targeted human review to catch formatting quirks and footnotes. Deliverables • A single CSV or Excel file where each row is a firm-year filing and each column holds one of the compensation items above, clearly labeled. • A short read-me describing the extraction logic, any LLM prompts used, and the quality-control steps you applied. • A reproducible script or notebook so I can rerun the pipeline on future filings. Acceptance criteria • ≥ 95 % of filings processed; missing cases flagged with reasons. • Random audit of 50 filings must show ≤ 5% field-level error rate. • Output passes numeric sanity checks (e.g., no negative salaries, totals match table footings when provided). If you have experience parsing SEC filings or have already built hybrid scraping/LLM solutions, that will help you move quickly. Let me know how you plan to split automation versus manual review, which tools or models you prefer, and your estimated turnaround time for the full 5,000-file set.

Python

Регистрация