Build Large Scale Red Teaming Harness

Бюджет: 25 $

I’m standing up a new platform that converts raw production traces into fully runnable sandboxes so enterprises can probe and harden their AI agents under real-world pressure. The first puzzle piece I need is the large-scale simulation and red-teaming harness. Here’s the job in plain terms • Take streams of user activity logs (our chosen ground truth) and drive realistic, concurrent simulations that an agent must survive. • Orchestrate these simulations at scale—think thousands of parallel runs—with deterministic replay, fault injection, and easy scenario authoring. • Expose rich hooks for automatic scoring, guardrail enforcement, and CI-style pass/fail gates so agents can be promoted or rolled back based on evidence. • Package the whole harness so it plugs cleanly into the broader platform once the environment-synthesis engine and privacy-safe ingestion pipelines come online. Tech choices are flexible, but you’ll likely lean on container orchestration (Kubernetes or similar), a high-throughput message bus, and a language comfortable with async IO (Go, Rust, or Python with asyncio are all fine). Robust observability, testability, and multi-tenant safety are must-haves from day one. Deliverables (acceptance criteria) 1. Codebase and IaC scripts that spin up the harness on a vanilla cloud account. 2. Simulation engine that replays sanitized user activity logs with configurable tempo and randomness. 3. Metric emitter producing per-run dashboards and an API endpoint the evaluation loop can poll. 4. Documentation and sample scenarios that prove an AI agent can be red-teamed automatically, complete with pass/fail output. If you love building reliable back-end systems that break things—on purpose—so everyone learns faster, let’s talk.

Python

Регистрация