STT, TTS , LLM Video Autoreply System

Бюджет: 250 $

I need a lightweight, end-to-end solution that listens to spoken Indonesian during a live video call, converts the speech to text, passes that text through a small language model to craft an automatic reply, and then speaks the response back in a natural-sounding English voice. Scope and key points • Speech-to-Text: real-time transcription of incoming audio (Whisper, Google Speech, or a comparable engine). • Language Model: compact LLM (OpenAI API, Llama.cpp, or similar) that turns the transcript into a concise reply. • Text-to-Speech: natural voice rendering—no robotic tone—ready to play back with minimal latency. • Pipeline: audio in ➜ STT ➜ LLM ➜ TTS ➜ audio out, all wrapped in a script or small service that I can trigger during a call. • Compatibility: desktop environment (Windows or Linux); bonus if it can hook into common video-meeting tools via virtual audio cable or WebRTC. • Deliverables: source code, quick-start guide, and a short demo video proving the loop works in real time. Timing is critical—I’m aiming to see a working first cut as soon as possible, then iterate quickly if tweaks are needed. If you’ve built anything similar, I’d love to hear about it when you reply.

Python

Реєстрація