Python-ZIM/PDF Harvester to SQLite FTS5

Бюджет: 750 $

Python Dev Wanted: Offline PDF/ZIM Harvester → SQLite FTS5 (Quick, Straightforward Build) Codename: Atlas Harvester (private client project under NDA) Goal: Build a small Python tool that ingests survival-related PDF/ZIM files from URLs, extracts text, categorizes content with a local LLM, and stores everything in a SQLite (FTS5) database for fast offline search. If you’ve built PDF parsers + SQLite FTS before, this is a quick win. What you’ll build (Phase 1) Input: One or more PDF/ZIM URLs Download: Save originals to /downloads/ and keep a snapshot (PDF copy) in /snapshots/ Parse: Extract clean text (+ headings & page numbers) using unstructured or PyMuPDF Chunk: ~1,000 tokens with small overlap Categorize & Summarize: Call a local LLM (e.g., Ollama) to assign 1–3 categories + short summary/key points per chunk Store: Insert into SQLite FTS5 table (chunks) with metadata (title, url, categories, page, snapshot_path, date_added) Idempotent: Re-runs shouldn’t duplicate rows (use checksum/versioning) Runs offline after initial setup (Windows 11 target; macOS bonus) Categories (Primary + Extended) Primary (12): Medical, Water, Food, Shelter, Fire, Navigation, Communication, Power, Security, Tools & Repair, Logistics & Planning, Community & Psychology Extended (5): Weather & Climate, Agriculture & Gardening, Wildlife & Ecology, Technology & Engineering, Education & Training Fallback: General Survival (Multi-label tagging allowed: 1–3 per chunk.) Nice to have (stub only in Phase 1) Video hooks reserved in schema (no build now): source_type='video', video_url, timestamp_start/end, thumbnail_path (We’ll add yt-dlp + Whisper later as a separate milestone.) Tech you likely already use Python 3, requests, unstructured or PyMuPDF, sqlite3 / FTS5, tqdm Ollama (or equivalent local LLM runner) for categorization/summaries Deliverables atlas_harvest.py (main script) config.yaml (model name, chunk size/overlap, paths, categories list) requirements.txt README.md (setup/run guide) Sample output: atlas.db (SQLite FTS5), /snapshots/, /downloads/, results.csv Clean logs; skip-duplicates; clear errors Acceptance tests (I will run) Ingest 2–3 public PDFs → searchable chunks appear with categories, page numbers, and working snapshot_path FTS5 search returns relevant chunks for “purify water”, “treat hypothermia” Re-run same URLs → no duplicate inserts (checksum/versioning works) Config change (e.g., chunk_size) affects new ingests ZIM handling: read-only pointer or minimal extraction path documented Runs offline after initial model pull Milestones & Payment (paid only after tests pass) All IP belongs to client. Timeline & Budget Target: 7–10 days total Quick screening (answer briefly) Built a Python PDF → SQLite (FTS/FTS5) pipeline before? (Yes/No) Comfortable calling a local LLM (Ollama) from Python? (Yes/No) Preferred PDF parser and why: unstructured vs PyMuPDF vs tika How you’ll handle idempotency (checksums/versioning) One-paragraph plan for storing page numbers + snapshot paths so a UI can jump to the right section How to apply 2–3 lines on relevant experience Link to a small, similar Python repo/snippet (or brief portfolio) Your estimate & availability If this reads like something you’ve already built, this will be fast. Looking forward to hiring today.

Python

Реєстрація