VOICE AI ENGINEER: Real-Time Prosody-Aware Conversational System I need an experienced Python developer to build a production-grade voice agent that processes not just WHAT people say, but HOW they say it - using acoustic analysis to adapt conversation strategy in real-time. This requires expertise in audio signal processing, emotion detection from voice, and low-latency streaming architecture. This is NOT a standard chatbot project. TECHNICAL STACK REQUIREMENTS: 1. DUAL-TRACK AUDIO PROCESSING • Real-time speech-to-text (Groq Whisper or equivalent high-speed STT) • Parallel prosodic feature extraction (F0 pitch, intensity, speech rate, pause patterns) • Emotion classification from acoustic signatures • Context fusion layer combining transcript + voice features 2. CONVERSATIONAL STATE MANAGEMENT • Finite state machine for multi-turn conversations • Dynamic response selection based on conversation state + detected emotion • Progress tracking through conversation stages • Structured knowledge base with fast retrieval (<50ms) 3. ADAPTIVE VOICE OUTPUT • Dynamic prosody control for TTS (pace, pitch, intensity modulation) • SSML generation based on emotional context • Template-based prosody patterns for different scenarios 4. HIGH-PERFORMANCE INFRASTRUCTURE • Sub-300ms end-to-end latency target • Groq LPU or comparable inference platform • Real-time streaming pipeline (Pipecat or similar framework) • Telephony integration (Twilio/similar) 5. STRUCTURED KNOWLEDGE SYSTEM • JSON-based response database • Emotion-conditioned content variants • Metadata for prosody instructions • In-memory architecture for speed 6. INTEGRATION & LOGGING • CRM webhook integration (HubSpot/similar) • Conversation transcripts with emotional state tracking • Audio recording storage • Performance metrics logging REQUIRED TECHNICAL EXPERIENCE: ✅ Real-time audio signal processing ✅ Acoustic feature extraction and prosody analysis ✅ Emotion/sentiment detection from voice (not just text) ✅ Low-latency system design (<500ms response requirements) ✅ State machine implementation for goal-oriented conversations ✅ Streaming audio pipelines STRONGLY PREFERRED: - High-speed inference platforms - Real-time voice frameworks - Production voice AI deployments DELIVERABLES: PHASE 1 - CORE PIPELINE (Weeks 1-3) □ Parallel processing: STT + prosody extraction running simultaneously □ Emotion classifier (4+ emotion categories from voice features) □ Basic state machine with 5+ conversation states □ TTS with dynamic prosody control (3+ different voice styles) PHASE 2 - ORCHESTRATION (Weeks 4-5) □ Conversational orchestrator managing state transitions □ Knowledge base structure with emotion-conditioned responses □ Response selection logic based on state + emotion □ Conversation flow testing framework PHASE 3 - INTEGRATION (Weeks 6-7) □ Twilio telephony integration (outbound calling) □ CRM webhook implementation □ Audio recording and transcript storage □ Performance monitoring dashboard PHASE 4 - OPTIMIZATION (Week 8) □ Latency optimization (<300ms target) □ Error handling and fallback logic □ Deployment documentation □ Testing suite ACCEPTANCE CRITERIA: ✓ System detects distinct emotional states from voice with >70% accuracy ✓ Agent adapts its speaking style (pace/tone) based on detected emotion ✓ Conversation progresses through defined states based on user responses ✓ End-to-end latency <500ms (speech input → speech output) ✓ Calls are logged with transcript + emotional state progression ✓ Modular codebase with clear separation of concerns ✓ Runs on cloud infrastructure with Docker deployment TO APPLY, PLEASE PROVIDE: 1. Links showing relevant voice AI or audio processing work 2. Brief technical explanation: "How would you extract emotional state from a 3-second audio clip in real-time?" 3. Availability and estimated timeline NOTE: The actual conversation logic and content will be provided by us. You're building the technical infrastructure, not designing the conversational strategy. Budget: subject to quotation - hourly rate or fixed or hybrid (depending on experience and timeline) Timeline: 6-8 weeks for complete system Language - English Location: Remote (must overlap 4+ hours with Melbourne Australia)