AI-Driven Secrets Scraper

Customer: AI | Published: 16.11.2025
Бюджет: 30 $

I need a script—Python is fine, but I’m open to another language—that taps the OpenAI API to semantically comb through both GitHub and GitLab repositories, flagging any files that expose hard-coded secrets, hostnames, or IP addresses. Here is the essence of the assignment: • Query GitHub and GitLab (public code only) in bulk, using OpenAI’s embeddings or GPT search to surface potential matches that traditional regex might miss. • Confirm each hit with lightweight pattern checks so we minimise false positives. • Capture the repository URL, file path, line number, the matched string (masked if needed), and a short AI-generated “why this looks sensitive” explanation. • Output everything to a single, well-structured CSV. When the run is finished I expect: 1. The annotated CSV. 2. All reusable source code with brief setup notes (API keys, rate-limit handling, etc.). 3. A quick README describing how to extend the search terms or add new host platforms later. I’ll test by running the script on my own token and confirming that at least a sample of results match the sensitive data types above. If you have questions about rate limits, deduplication, or improving precision, let’s iron those out early so the first deliverable is already useful.