Voice & Narrative
MemGhost supports voice interaction with AI agents through two local services: Kokoro for text-to-speech (TTS) and Whisper for speech-to-text (STT). Voice features are optional and run entirely on your hardware.
Voice features require AI to be enabled (AI_ENABLED=true) since they enhance the AI chat experience.
Enabling Voice
Add the voice profile and enable TTS/STT in your .env:
AI_ENABLED=trueAI_TTS_ENABLED=trueAI_STT_ENABLED=true
# Start with all profilesdocker compose --profile standalone --profile ai --profile voice up -dText-to-Speech (TTS)
TTS is powered by Kokoro, a lightweight 82M-parameter speech synthesis model that runs on CPU. It supports 67+ voices across multiple languages and genders.
How It Works
When TTS is enabled and you’re chatting with an AI agent, the response is synthesized to audio in real time. As text streams in, sentences are detected at punctuation boundaries, queued for synthesis, and played back sequentially. This produces a natural conversational flow where the agent “speaks” its response as it’s being generated.
Voice Selection
Each agent can have a default voice configured. Voices follow the naming pattern {lang}{gender}_{name}:
| Category | Examples |
|---|---|
| American Female | af_heart, af_alloy, af_bella, af_nova, af_sky |
| American Male | am_echo, am_adam, am_liam, am_onyx |
| British Female | bf_emma, bf_lily, bf_alice |
| British Male | bm_george, bm_daniel |
You can change an agent’s voice in the chat settings. The full voice list is available via the API at /tts/kokoro/voices.
Audio Formats
TTS supports multiple output formats:
| Format | Description |
|---|---|
mp3 | Default, good quality and compression |
wav | Uncompressed, lowest latency |
opus | Excellent compression, good for streaming |
flac | Lossless compression |
Configure the default format via AI_TTS_DEFAULT_FORMAT in your .env.
Speech-to-Text (STT)
STT is powered by Whisper.cpp, running the whisper-large-v3-turbo model for fast and accurate transcription.
How It Works
STT uses voice activity detection (VAD) running in the browser via an ONNX model. When you activate voice input:
- The browser listens for speech using the VAD model
- When speech is detected, audio is captured
- When speech ends (after a short grace period), the audio is sent to the Whisper server
- The transcription is returned and inserted into the chat input
The microphone is intelligent — it waits for you to speak and automatically detects when you stop, rather than requiring push-to-talk.
Voice Conversation Mode
When both TTS and STT are enabled, you can activate voice conversation mode from the chat header. In this mode:
- The agent speaks its response via TTS
- While the agent is speaking, the microphone is paused (to avoid picking up the TTS audio)
- After the agent finishes speaking, the microphone resumes listening
- You speak your response, which is automatically transcribed and sent
This creates a natural back-and-forth conversation without touching the keyboard.
Configuration Reference
TTS Variables
| Variable | Default | Description |
|---|---|---|
AI_TTS_ENABLED | false | Enable text-to-speech |
AI_TTS_BASE_URL | http://kokoro:8880 | Kokoro API URL |
AI_TTS_DEFAULT_VOICE | af_heart | Default voice ID |
AI_TTS_DEFAULT_FORMAT | mp3 | Audio output format |
AI_TTS_DEFAULT_SPEED | 1.0 | Playback speed multiplier |
STT Variables
| Variable | Default | Description |
|---|---|---|
AI_STT_ENABLED | false | Enable speech-to-text |
AI_STT_BASE_URL | http://whisper:8178 | Whisper API URL |
AI_STT_MODEL | whisper-large-v3-turbo | Whisper model |
AI_STT_LANGUAGE | en | Default transcription language |
Hardware Requirements
| Service | CPU | RAM | Disk |
|---|---|---|---|
| Kokoro (TTS) | 1 core | ~500 MB | ~200 MB (model) |
| Whisper (STT) | 1 core | ~1 GB | ~1.5 GB (model) |
Both services run on CPU only. Synthesis latency for Kokoro is typically under 1 second per sentence. Whisper transcription takes 1-3 seconds depending on audio length.