Mass PDF Database Extraction

I have to pull many thousands of PDF files from a publicly available but poorly structured online database. The pages are slow, there are no clear download links, and navigation relies on clunky JavaScript forms, so a straightforward “save as” approach will take far too long. You will receive a text file that contains the exact filenames for every document I need. Those filenames appear in the HTML once the record is loaded, so they can be used as reliable anchors for the scrape. The order in which the files arrive does not matter; accuracy and completeness do. I expect an automated approach—Python with Selenium, Playwright, Scrapy, or any comparable tool is fine—as long as it can work around the site’s fragile structure and occasional timeouts. If headless browsing or rate-limiting tricks are required, please build them in. Deliverables: • A zipped archive (or split archives) containing every requested PDF. • The runnable script with clear, inline comments so I can repeat the process in future. I hope to be able to run this program every few weeks to capture up to date files. • A brief README explaining environment setup, command-line usage, and any third-party libraries. I will validate the job by spot-checking a random sample of filenames against the list I provide and by ensuring the script reproduces the full download set on my end without manual tweaks. The above is AI generated for this job - the following is my description. I want to create a readable store/database of AFCA decisions. Their website is afca.org.au. My plan is to create a ChatCGP (or similar) AI tool to summarise each determination, or search across all determinations for keywords or phrases. AFCA publish each determination in a pdf document. Obviously, I only need the text for each determination. So whether your tool captures each pdf or simply gathers the text as a separate .txt file is a matter for you. As far as storage size goes, obviously .txt files will be far smaller. I don't need an Access or similar database created, I seek only the documents themselves for use in an AI environment. As far as indexing goes, we can start with this : Date: Determination/Case number: Financial Firm Creating an index in Excel or similar seems to be easiest. Those details are captured on the 1st page of each determination. At the outset, there will be many, many 000's of determinations across the old and new databases. Their online search facility is very poor. Older determinations (2018-2024) https://www.afca.org.au/what-to-expect/search-published-decisions take note of this - Service advisory: We are aware that some PDF links show the message *error opening/reading pdf file* — if you see this message, please disregard it. Simply click the link and the PDF will open as normal. Newer determinations (since 2024) https://my.afca.org.au/searchpublisheddecisions/?_gl=1*gbf20z*_gcl_au*MjEwNjExNjQxNi4xNzY2MTg0OTg5 I would be happy starting with the newer determinations only to check for validity, then look at the older database.

Python

Регистрация