Scale Python Web App GCP

We have a Python-based processing pipeline on Google Cloud. 1 When a user uploads a large file, our backend triggers a **Cloud Run Job**. 2 This job splits the file into multiple smaller parts, uploads them to GCS, and then sends each part to the **Vertex AI Gemini API** for processing. ### Current Setup 1 A semaphore is used with **max 12 concurrent requests per job**. 2 Example: a A 100-page file is split into 100 parts. b At most 12 requests are sent concurrently to Gemini. c As responses return, more requests are queued and sent. ### The Problem * Even with a **single user / single Cloud Run job**, we often hit: **`429 Resource Exhausted` errors**. * Gemini APIs reportedly use **dynamic shared quota**, so it’s not predictable. * This raises concern that at scale (many users/jobs) the system could break. ### What We’re Looking For * Someone who has **faced and solved similar quota/concurrency issues with Gemini API or Vertex AI APIs**. * We **don’t want suggestions** like: * Switching to different Gemini models * Paying for provisioned throughput * Our stack is fixed on **Gemini 2.5 Pro / flash-latest**. ### Payment * If your solution works in practice, we’ll pay you. * Please only apply if you have **hands-on experience fixing this type of issue**.

Python

Реєстрація