-
Notifications
You must be signed in to change notification settings - Fork 166
Description
Python -VV
Python 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]Pip Freeze
mistralai==1.12.2
websockets==15.0.1Reproduction Steps
- Open a realtime transcription WebSocket to
voxtral-mini-transcribe-realtime-2602. - Stream PCM audio of a user saying "Who built the White Tower" via
input_audio.append. - Audio finishes. Do not send
input_audio.end. Observetranscription.text.deltaevents. - Model emits deltas for
Who,built,the,White— then stops.Towernever arrives. - 30 seconds later, server returns error 3804: "Timeout waiting for response from streaming transcription."
- Sending
input_audio.enddoes flushTowerand triggerstranscription.done, but the connection becomes terminal — no more audio accepted.
Reproducible on every utterance. The model holds the last 1-2 tokens until it receives input_audio.end.
import asyncio
from mistralai import Mistral
from mistralai.models import AudioFormat
async def main():
client = Mistral(api_key="YOUR_KEY")
rt = client.audio.realtime.transcription
async with await rt.connect(
model="voxtral-mini-transcribe-realtime-2602",
audio_format=AudioFormat(encoding="pcm_s16le", sample_rate=16000),
) as conn:
for chunk in audio_chunks:
await conn.send_audio(chunk)
# conn.end_audio() flushes buffered tokens, but kills the connection.
# No alternative exists to flush without ending.
async for event in conn.events():
print(event) # last word never arrives without end_audio()
asyncio.run(main())Expected Behavior
The SDK needs a non-terminal flush: "emit all buffered tokens and transcription.done for the current utterance, then accept new audio on the same WebSocket."
Additional Context
The native C implementation (antirez/voxtral.c) already distinguishes these two operations:
vox_stream_flush()— processes buffered audio, emits pending tokens. Keeps the stream open. Used on silence detection between utterances.vox_stream_finish()— terminal. No more audio accepted.
The cloud API's input_audio.end maps to finish. There is no flush equivalent.
Without a non-terminal flush, any application doing multi-turn transcription (voice agents, live captioning with speaker pauses, etc.) must create a new WebSocket per utterance, paying ~200-500ms connection setup overhead each time. The model also buffers the last word of every utterance until end_audio(), forcing idle-timeout heuristics that add latency and sometimes truncate transcripts.
Suggested Solutions
Add input_audio.flush as a client-to-server message:
{"type": "input_audio.flush"}On receiving this, the model processes all buffered audio, emits remaining transcription.text.delta events and a transcription.done with the complete utterance text. The audio stream stays open for subsequent input_audio.append messages. This mirrors vox_stream_flush() in voxtral.c.
SDK addition:
async def flush_audio(self) -> None:
"""Flush buffered audio without ending the stream."""
if self._closed:
raise RuntimeError("Connection is closed")
await self._websocket.send(json.dumps({"type": "input_audio.flush"}))