Skip to content
memghost.com Open App

Voice & Narrative

MemGhost supports voice interaction with AI agents through two local services: Kokoro for text-to-speech (TTS) and Whisper for speech-to-text (STT). Voice features are optional and run entirely on your hardware.

Voice features require AI to be enabled (AI_ENABLED=true) since they enhance the AI chat experience.

Enabling Voice

Add the voice profile and enable TTS/STT in your .env:

.env
AI_ENABLED=true
AI_TTS_ENABLED=true
AI_STT_ENABLED=true
# Start with all profiles
docker compose --profile standalone --profile ai --profile voice up -d

Text-to-Speech (TTS)

TTS is powered by Kokoro, a lightweight 82M-parameter speech synthesis model that runs on CPU. It supports 67+ voices across multiple languages and genders.

How It Works

When TTS is enabled and you’re chatting with an AI agent, the response is synthesized to audio in real time. As text streams in, sentences are detected at punctuation boundaries, queued for synthesis, and played back sequentially. This produces a natural conversational flow where the agent “speaks” its response as it’s being generated.

Voice Selection

Each agent can have a default voice configured. Voices follow the naming pattern {lang}{gender}_{name}:

CategoryExamples
American Femaleaf_heart, af_alloy, af_bella, af_nova, af_sky
American Maleam_echo, am_adam, am_liam, am_onyx
British Femalebf_emma, bf_lily, bf_alice
British Malebm_george, bm_daniel

You can change an agent’s voice in the chat settings. The full voice list is available via the API at /tts/kokoro/voices.

Audio Formats

TTS supports multiple output formats:

FormatDescription
mp3Default, good quality and compression
wavUncompressed, lowest latency
opusExcellent compression, good for streaming
flacLossless compression

Configure the default format via AI_TTS_DEFAULT_FORMAT in your .env.

Speech-to-Text (STT)

STT is powered by Whisper.cpp, running the whisper-large-v3-turbo model for fast and accurate transcription.

How It Works

STT uses voice activity detection (VAD) running in the browser via an ONNX model. When you activate voice input:

  1. The browser listens for speech using the VAD model
  2. When speech is detected, audio is captured
  3. When speech ends (after a short grace period), the audio is sent to the Whisper server
  4. The transcription is returned and inserted into the chat input

The microphone is intelligent — it waits for you to speak and automatically detects when you stop, rather than requiring push-to-talk.

Voice Conversation Mode

When both TTS and STT are enabled, you can activate voice conversation mode from the chat header. In this mode:

  1. The agent speaks its response via TTS
  2. While the agent is speaking, the microphone is paused (to avoid picking up the TTS audio)
  3. After the agent finishes speaking, the microphone resumes listening
  4. You speak your response, which is automatically transcribed and sent

This creates a natural back-and-forth conversation without touching the keyboard.

Configuration Reference

TTS Variables

VariableDefaultDescription
AI_TTS_ENABLEDfalseEnable text-to-speech
AI_TTS_BASE_URLhttp://kokoro:8880Kokoro API URL
AI_TTS_DEFAULT_VOICEaf_heartDefault voice ID
AI_TTS_DEFAULT_FORMATmp3Audio output format
AI_TTS_DEFAULT_SPEED1.0Playback speed multiplier

STT Variables

VariableDefaultDescription
AI_STT_ENABLEDfalseEnable speech-to-text
AI_STT_BASE_URLhttp://whisper:8178Whisper API URL
AI_STT_MODELwhisper-large-v3-turboWhisper model
AI_STT_LANGUAGEenDefault transcription language

Hardware Requirements

ServiceCPURAMDisk
Kokoro (TTS)1 core~500 MB~200 MB (model)
Whisper (STT)1 core~1 GB~1.5 GB (model)

Both services run on CPU only. Synthesis latency for Kokoro is typically under 1 second per sentence. Whisper transcription takes 1-3 seconds depending on audio length.