Production Recommendation Engine Hardening

Our recommendation engine has been live for a while and serves millions of daily requests, yet it is beginning to show growing pains—intermittent errors, latency spikes, and a few edge-case failures that compromise trust in the suggestions users see. I need an engineer who can jump straight into a Python-based micro-service environment (FastAPI, Redis, PostgreSQL, TensorFlow/PyTorch models wrapped behind gRPC) and stabilise the entire pipeline. Scope of work • Trace and resolve the root causes of the current crashes and data-leak edge cases. • Refactor brittle sections so that unit, integration, and load tests pass consistently. • Re-tune ranking logic to restore precision and recall that regressed after the last model update. • Introduce circuit-breaking, graceful fall-backs, and structured logging that can be monitored through Prometheus + Grafana. • Containerise updates for zero-downtime rolling deployment via Kubernetes (Helm charts already in place). Deliverables 1. A clean merge-request with all patches, new tests, and updated documentation. 2. A reproducible load-test report demonstrating <150 ms P95 latency at 5 k RPS and no error burst >0.1 %. 3. A short operational run-book covering health checks, feature flag toggles, and rollback steps. Acceptance criteria • All CI/CD stages turn green. • Production traffic shadow run for 48 h shows equal or better CTR compared with the current baseline. • On-call dashboard remains in the “green” SLO band for the same period. I will provide direct access to the private Git repo, synthetic datasets, anonymised traffic samples, and current Prometheus metrics. You’ll have a dedicated Slack channel with our data and DevOps engineers for any clarifications while you harden the system end-to-end.

Python

Реєстрація