Custom French PDF RAG Pipeline

I want to stand up an end-to-end Retrieval-Augmented Generation workflow in Python that I can run either on my own machine or in a fresh Google Colab notebook. The data source will be a mixed collection of 10 to 50 French-language PDFs—some are born-digital, others are scanned images—so text extraction has to switch seamlessly between classic parsers (e.g., PyPDF2/PDFMiner) and OCR with Tesseract or equivalent. Once the raw text is available, the script should create embeddings with fully open-source models (Sentence-Transformers or a French model from Hugging Face is fine), store them locally in something lightweight like FAISS or Chroma, and expose a simple retrieval+generation loop. Users will ask both straightforward fact questions and higher-level summary prompts, and the system must return concise answers in French with in-line citations pointing back to page-level sources. No paid APIs are allowed at any stage. Configuration ought to live in a YAML file so I can tweak paths, model names, chunk size, retrieval k, and Colab vs. local runtime without touching the core code. Clear logging and comments are appreciated so I can adapt the pipeline later. Deliverables • Clean, modular Python code (scripts or notebook) implementing extraction, OCR, embedding creation, vector store, and QA generation • A sample config.yaml plus example command lines or cells that demonstrate local and Colab execution • README.md with setup instructions, required open-source libraries, and hardware notes • Brief usage guide showing how to drop a new batch of PDFs into the folder and obtain cited answers I’ll share a fuller technical brief after we connect; for now, let me know your experience with similar RAG builds and any open-source French models you prefer. PS: A detailed technical brief will be shared privately after shortlisting.

Python

Регистрация