Voice & Narrative

MemGhost supports voice interaction with AI agents through two local services: Kokoro for text-to-speech (TTS) and Whisper for speech-to-text (STT). Voice features are optional and run entirely on your hardware.

Voice features require AI to be enabled (AI_ENABLED=true) since they enhance the AI chat experience.

Enabling Voice

Add the voice profile and enable TTS/STT in your .env:

AI_ENABLED=true
AI_TTS_ENABLED=true
AI_STT_ENABLED=true

# Start with all profiles
docker compose --profile standalone --profile ai --profile voice up -d

Text-to-Speech (TTS)

TTS is powered by Kokoro, a lightweight 82M-parameter speech synthesis model that runs on CPU. It supports 67+ voices across multiple languages and genders.

How It Works

When TTS is enabled and you’re chatting with an AI agent, the response is synthesized to audio in real time. As text streams in, sentences are detected at punctuation boundaries, queued for synthesis, and played back sequentially. This produces a natural conversational flow where the agent “speaks” its response as it’s being generated.

Voice Selection

Each agent can have a default voice configured. Voices follow the naming pattern {lang}{gender}_{name}:

Category	Examples
American Female	`af_heart`, `af_alloy`, `af_bella`, `af_nova`, `af_sky`
American Male	`am_echo`, `am_adam`, `am_liam`, `am_onyx`
British Female	`bf_emma`, `bf_lily`, `bf_alice`
British Male	`bm_george`, `bm_daniel`

You can change an agent’s voice in the chat settings. The full voice list is available via the API at /tts/kokoro/voices.

Audio Formats

TTS supports multiple output formats:

Format	Description
`mp3`	Default, good quality and compression
`wav`	Uncompressed, lowest latency
`opus`	Excellent compression, good for streaming
`flac`	Lossless compression

Configure the default format via AI_TTS_DEFAULT_FORMAT in your .env.

Speech-to-Text (STT)

STT is powered by Whisper.cpp, running the whisper-large-v3-turbo model for fast and accurate transcription.

How It Works

STT uses voice activity detection (VAD) running in the browser via an ONNX model. When you activate voice input:

The browser listens for speech using the VAD model
When speech is detected, audio is captured
When speech ends (after a short grace period), the audio is sent to the Whisper server
The transcription is returned and inserted into the chat input

The microphone is intelligent — it waits for you to speak and automatically detects when you stop, rather than requiring push-to-talk.

Voice Conversation Mode

When both TTS and STT are enabled, you can activate voice conversation mode from the chat header. In this mode:

The agent speaks its response via TTS
While the agent is speaking, the microphone is paused (to avoid picking up the TTS audio)
After the agent finishes speaking, the microphone resumes listening
You speak your response, which is automatically transcribed and sent

This creates a natural back-and-forth conversation without touching the keyboard.

Configuration Reference

TTS Variables

Variable	Default	Description
`AI_TTS_ENABLED`	`false`	Enable text-to-speech
`AI_TTS_BASE_URL`	`http://kokoro:8880`	Kokoro API URL
`AI_TTS_DEFAULT_VOICE`	`af_heart`	Default voice ID
`AI_TTS_DEFAULT_FORMAT`	`mp3`	Audio output format
`AI_TTS_DEFAULT_SPEED`	`1.0`	Playback speed multiplier

STT Variables

Variable	Default	Description
`AI_STT_ENABLED`	`false`	Enable speech-to-text
`AI_STT_BASE_URL`	`http://whisper:8178`	Whisper API URL
`AI_STT_MODEL`	`whisper-large-v3-turbo`	Whisper model
`AI_STT_LANGUAGE`	`en`	Default transcription language

Hardware Requirements

Service	CPU	RAM	Disk
Kokoro (TTS)	1 core	~500 MB	~200 MB (model)
Whisper (STT)	1 core	~1 GB	~1.5 GB (model)

Both services run on CPU only. Synthesis latency for Kokoro is typically under 1 second per sentence. Whisper transcription takes 1-3 seconds depending on audio length.