I’m looking for a senior AI/ML engineer who can take our already-launched, multi-agent AI platform and make it both safer and more scalable. The codebase is mature—Python first (PyTorch, TensorFlow, Hugging Face) with pockets of Node.js/TypeScript—but it now runs at a scale where reliability and controlled tool-use are just as important as raw model quality. The immediate priority is to introduce rigorous guardrails around tool invocation, prompt injection, and inter-agent communication while we refactor the distributed architecture that carries the workload. You’ll be digging into orchestration logic, reinforcement-learning-powered evaluators, and the CI/CD pipelines that push new checkpoints into production every day. Expect to pair with our ML Ops lead on model registries, versioning, and observability so the safety layer never lags behind new releases. Key deliverables • Audit the current multi-agent flow and surface safety gaps around external tool calls • Design and implement a guardrail framework (policy engine, trace logging, rollback hooks) that plugs into our existing service mesh • Re-architect the distributed runtime for higher throughput and fault tolerance, keeping latency targets intact • Hand over actionable documentation and a lightweight test harness the team can extend after you roll off Acceptance criteria 1. Guardrail coverage proves ≥ 95 % of malicious prompt cases blocked in staging tests 2. System maintains current P99 latency after migration to the new distributed layout 3. CI pipeline automatically validates safety rules before any model or agent code is promoted If you’ve shipped production agents, tuned large models, and kept them safe under real traffic, I’d love to dive deeper and map out the engagement.