Optimize GPU Scheduling for CNNs

I need to squeeze every dollar of value out of my cloud GPUs while training Convolutional Neural Networks. At the moment, individual jobs over-reserve memory, idle cores pile up between epochs, and background housekeeping steals cycles I’m already paying for. The goal is clear: redesign the scheduling layer so that each GPU is kept busy with just enough resources—nothing more—until the last gradient update finishes. Every percentage point of utilization saved translates directly into lower cloud spend, so cost efficiency is the single metric that matters. Your task will revolve around three pillars: 1. Build or adapt a smart scheduler that can place CNN training workloads across multiple GPUs and nodes, reallocating resources on the fly when utilization drops. If you prefer to leverage existing frameworks (Kubernetes, Slurm, Ray, etc.) instead of starting from scratch, that’s fine as long as the final solution plugs cleanly into my current PyTorch pipeline. 2. Profile current runs, identify the exact bottlenecks—data loading stalls, communication overhead, memory fragmentation—and implement the fixes. CUDA graphs, mixed-precision, gradient accumulation, or better dataloader caching: use whatever techniques actually reduce idle time. 3. Prove the improvement with before-and-after benchmarks. I need clear numbers showing GPU utilization and total training cost per epoch for the same CNN model and dataset. Deliverables: • Source code or configuration files for the scheduler and any auxiliary scripts • A reproducible benchmark notebook (or markdown report) that demonstrates the utilization gains and cost savings • A concise setup guide so I can roll the solution into production When you apply, include a detailed project proposal outlining the approach, toolchain, and a rough timeline. Past work summaries are optional; a clear, actionable plan is what will win the job.

Python

Регистрация