Low-Latency Cross-Platform Transcription

I’m building a live speech-to-text tool that listens to a dynamic-mic input and turns it into text in real time. The application has to run on Windows, macOS, and Linux without relying on a GPU. Staying strictly in CPU-only mode, it should keep total RAM usage below 500 MB yet still reach close to 95 – 100 % word-level accuracy. Latency needs to be low enough for smooth on-screen captioning—think fractions of a second, not seconds. I’m open to whichever stack makes that possible—Whisper.cpp, Vosk, Kaldi, on-device quantised models, or a custom approach—as long as you optimise it for speed and memory. The finished result should include: • A runnable binary or easily reproduced build for the three desktop platforms • Source code with clear comments and a concise README covering setup, model download, and typical CPU usage numbers • A minimal CLI (or small GUI if you prefer) that starts listening, prints live transcripts, and exposes a latency/readability toggle • A short verification log demonstrating the memory footprint, latency, and accuracy on a sample recording from a dynamic mic - I dont choosing any programming language, in fact choosing c++, rust and all would even be better over python and other heavy languages. If you have prior experience squeezing ASR models into tight footprints while keeping accuracy high, let’s talk.

Регистрация