Files

Cameron d23f0f9328 fix: harden voice memo delivery diagnostics (#536 )

Co-authored-by: Letta Code <noreply@letta.com>

2026-03-09 13:22:26 -07:00

7.3 KiB

Raw Blame History

Voice

LettaBot has full voice support: it can receive voice messages (transcribed to text) and reply with voice memos (generated via TTS). Both features work across Telegram, WhatsApp, Signal, Discord, and Slack.

Voice Transcription (Receiving Voice Messages)

When a user sends a voice message, LettaBot downloads the audio, transcribes it via the configured provider, and delivers the text to the agent with a [Voice message]: prefix.

Providers

OpenAI Whisper (default):

transcription:
  provider: openai
  apiKey: sk-...       # Optional: falls back to OPENAI_API_KEY env var
  model: whisper-1     # Default

Mistral Voxtral (faster, lower cost):

transcription:
  provider: mistral
  apiKey: ...          # Optional: falls back to MISTRAL_API_KEY env var
  model: voxtral-mini-latest  # Default

Or configure via environment variables alone:

# OpenAI (default provider when no config is set)
export OPENAI_API_KEY=sk-...

# Mistral (requires provider to be set in config)
export MISTRAL_API_KEY=...

If no API key is configured, users who send voice messages will receive an error message with a setup link.

Supported Audio Formats

These formats are sent directly to the transcription API (some with a filename remap):

flac, m4a, mp3, mp4, mpeg, mpga, oga, ogg, opus, wav, webm

These formats are automatically converted to MP3 via ffmpeg (if installed):

aac, amr, caf, 3gp, 3gpp

Files over 20MB are automatically split into 10-minute chunks before transcription.

Channel Support

Channel	Format received	Notes
Telegram	OGG/Opus	Native voice messages
WhatsApp	OGG/Opus	Push-to-talk voice messages
Signal	Various	Voice attachments
Discord	Various	Audio file attachments
Slack	Various	Audio file uploads

Voice Memos (Sending Voice Notes)

The agent can reply with voice notes using the <voice> directive. The text is sent to a TTS provider, converted to OGG Opus audio, and delivered as a native voice bubble (on Telegram and WhatsApp) or a playable audio attachment (on Discord and Slack).

How It Works

The agent includes a <voice> tag in its response:

<actions>
  <voice>Hey, here's a quick update on that thing we discussed.</voice>
</actions>

This can be combined with text -- anything after the </actions> block is sent as a normal message alongside the voice note:

<actions>
  <voice>Here's the summary as audio.</voice>
</actions>
And here it is in text form too!

See directives.md for the full directive reference.

Providers

ElevenLabs (default):

tts:
  provider: elevenlabs
  apiKey: sk_475a...                    # Or ELEVENLABS_API_KEY env var
  voiceId: onwK4e9ZLuTAKqWW03F9         # Or ELEVENLABS_VOICE_ID env var
  model: eleven_multilingual_v2         # Or ELEVENLABS_MODEL_ID

Browse voices at elevenlabs.io/voice-library.

OpenAI:

tts:
  provider: openai
  apiKey: sk-...                        # Or OPENAI_API_KEY env var
  voiceId: alloy                        # Or OPENAI_TTS_VOICE (options: alloy, echo, fable, onyx, nova, shimmer)
  model: tts-1                          # Or OPENAI_TTS_MODEL (use tts-1-hd for higher quality)

Channel Support

Channel	Delivery	Notes
Telegram	Native voice bubble	Falls back to audio file if user has voice message privacy enabled (Telegram Premium). Users can allow via Settings > Privacy and Security > Voice Messages.
WhatsApp	Native voice bubble	Sent with push-to-talk (`ptt: true`) for native rendering.
Discord	Audio attachment	Playable inline.
Slack	Audio attachment	Playable inline.
Signal	Audio attachment	Sent as a file attachment.

When to Use Voice

User sent a voice message and a voice reply feels natural
User explicitly asks for a voice/audio response
Short, conversational responses (under ~30 seconds of speech)

When NOT to Use Voice

Code snippets, file paths, URLs, or structured data -- these should be text
Long responses (keep voice under ~30 seconds)
When the user has indicated a preference for text

CLI Tools

`lettabot-tts`

Generate audio from the command line:

lettabot-tts "Hello, this is a test"           # Outputs file path to stdout
lettabot-tts "Hello" /tmp/output.ogg            # Explicit output path

Output files are written to data/outbound/ by default and auto-cleaned after 1 hour.

`lettabot-message --voice`

Send a voice note from a background task (heartbeat, cron):

# Generate + send in one step
OUTPUT=$(lettabot-tts "Your message here") || exit 1
lettabot-message send --file "$OUTPUT" --voice

# Send to a specific channel
lettabot-message send --file "$OUTPUT" --voice --channel telegram --chat 123456

Environment Variable Reference

Variable	Description	Default
Transcription
`OPENAI_API_KEY`	OpenAI API key (Whisper transcription + OpenAI TTS)	--
`MISTRAL_API_KEY`	Mistral API key (Voxtral transcription)	--
`TRANSCRIPTION_MODEL`	Override transcription model	`whisper-1` / `voxtral-mini-latest`
Text-to-Speech
`TTS_PROVIDER`	TTS backend	`elevenlabs`
`ELEVENLABS_API_KEY`	ElevenLabs API key	--
`ELEVENLABS_VOICE_ID`	ElevenLabs voice ID	`onwK4e9ZLuTAKqWW03F9`
`ELEVENLABS_MODEL_ID`	ElevenLabs model	`eleven_multilingual_v2`
`OPENAI_TTS_VOICE`	OpenAI TTS voice name	`alloy`
`OPENAI_TTS_MODEL`	OpenAI TTS model	`tts-1`

All environment variables can be overridden by the equivalent YAML config fields (see above).

Troubleshooting

Voice messages not transcribing

Check that an API key is configured -- either in lettabot.yaml under transcription.apiKey or via the OPENAI_API_KEY / MISTRAL_API_KEY environment variable
Check the logs for transcription errors
If using an unsupported audio format, install ffmpeg for automatic conversion

Voice memos not generating

Check that a TTS provider is configured -- either in lettabot.yaml under tts or via ELEVENLABS_API_KEY / OPENAI_API_KEY
Check that jq and curl are installed (required by the lettabot-tts script)
Check logs for voice pipeline events:
- [Bot] Directive voice: generating memo (...)
- [Bot] Directive voice: generated file ...
- [Bot] Directive voice failed: ...
- [Telegram] sendVoice failed, falling back to sendAudio: ...
Check logs for TTS API errors (HTTP status codes, rate limits)

Docker checklist for voice

For container images, ensure these binaries are available:

bash (required by lettabot-tts shebang)
curl and jq (required for TTS API calls)
ffmpeg (recommended for full inbound voice transcription compatibility)
ca-certificates (required for HTTPS API calls)

Quick runtime validation from inside the container:

which bash curl jq ffmpeg
lettabot-tts "TTS health check"

Telegram voice privacy

If the bot sends audio files instead of voice bubbles on Telegram, the recipient has voice message privacy enabled (Telegram Premium feature). They can allow voice messages via Settings > Privacy and Security > Voice Messages.

7.3 KiB Raw Blame History

Voice

Voice Transcription (Receiving Voice Messages)

Providers

Supported Audio Formats

Channel Support

Voice Memos (Sending Voice Notes)

How It Works

Providers

Channel Support

When to Use Voice

When NOT to Use Voice

CLI Tools

lettabot-tts

lettabot-message --voice

Environment Variable Reference

Troubleshooting

Voice messages not transcribing

Voice memos not generating

Docker checklist for voice

Telegram voice privacy

7.3 KiB

Raw Blame History

`lettabot-tts`

`lettabot-message --voice`