Automate PDF Phrase Extraction

I have a set of PDF files from which I need to pull out only certain phrases—no tables, headings, or other content. There are a few distinct phrase types and I want each type to land in its own column in a single Excel worksheet. Speed matters, so I’m leaning toward an AI-assisted Python solution that can rip through multiple PDFs in one go, spot the target phrases with reliable pattern matching or NLP, and then push clean, column-separated data straight into .xlsx. You’re free to choose whichever libraries you prefer—pdfplumber, PyPDF2, Camelot, spaCy, even a lightweight transformer model—so long as the final workflow is reproducible on my end with minimal setup. Deliverables: • Well-commented script (Python preferred) that takes a folder of PDFs as input • Output Excel file with each phrase type in its own column • Brief read-me explaining how to run the code and adjust phrase patterns if needed I’ll test by running the script on a fresh batch of PDFs; if every required phrase appears in the correct column with no extra text, the task is complete.

Python

Реєстрація