Self-Hosted AI Video & Voice with out latency!

Бюджет: 750 $

I’m putting together a fully local pipeline that can turn text or microphone input into real-time synthetic video with out latensy and matching speech without ever calling a paid API. What I need from you is a clear, reproducible worksheet that walks me from a blank machine to a working demo. My main pain-point is model selection and setup, so the document has to name the exact models, versions, weights and repos you recommend (Stable Diffusion / Stable Video Diffusion or similar on the visual side, plus an open-source TTS or voice-cloning stack). Everything must run on a local Linux server with an NVIDIA GPU, CUDA and Python—no external SaaS calls. Please include any build flags, environment variables, VRAM tips and latency-saving tricks that actually matter in practice, and show how to wire the video and audio streams together so that a single text prompt (or incoming audio) yields a synchronised clip in real time. If Docker or ComfyUI speeds things up, note that too, but keep the whole process self-contained. Deliverable • A step-by-step worksheet (Markdown or PDF) that installs, configures and launches the chosen models locally, finishing with a simple command or script that proves everything works in real time. I’ll judge the job complete when I can follow your worksheet on fresh hardware and generate a short talking-head video with coherent audio on my own box, no cloud calls, no licensing surprises.

Python

Реєстрація