Skip to content

[BUG CLIENT]: Realtime transcription WebSocket lacks non-terminal flush — model buffers last word(s) until input_audio.end #357

@zkewal

Description

@zkewal

Python -VV

Python 3.12.4 (main, Jun  6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]

Pip Freeze

mistralai==1.12.2
websockets==15.0.1

Reproduction Steps

  1. Open a realtime transcription WebSocket to voxtral-mini-transcribe-realtime-2602.
  2. Stream PCM audio of a user saying "Who built the White Tower" via input_audio.append.
  3. Audio finishes. Do not send input_audio.end. Observe transcription.text.delta events.
  4. Model emits deltas for Who, built, the, White — then stops. Tower never arrives.
  5. 30 seconds later, server returns error 3804: "Timeout waiting for response from streaming transcription."
  6. Sending input_audio.end does flush Tower and triggers transcription.done, but the connection becomes terminal — no more audio accepted.

Reproducible on every utterance. The model holds the last 1-2 tokens until it receives input_audio.end.

import asyncio
from mistralai import Mistral
from mistralai.models import AudioFormat

async def main():
    client = Mistral(api_key="YOUR_KEY")
    rt = client.audio.realtime.transcription
    async with await rt.connect(
        model="voxtral-mini-transcribe-realtime-2602",
        audio_format=AudioFormat(encoding="pcm_s16le", sample_rate=16000),
    ) as conn:
        for chunk in audio_chunks:
            await conn.send_audio(chunk)

        # conn.end_audio() flushes buffered tokens, but kills the connection.
        # No alternative exists to flush without ending.

        async for event in conn.events():
            print(event)  # last word never arrives without end_audio()

asyncio.run(main())

Expected Behavior

The SDK needs a non-terminal flush: "emit all buffered tokens and transcription.done for the current utterance, then accept new audio on the same WebSocket."

Additional Context

The native C implementation (antirez/voxtral.c) already distinguishes these two operations:

  • vox_stream_flush() — processes buffered audio, emits pending tokens. Keeps the stream open. Used on silence detection between utterances.
  • vox_stream_finish() — terminal. No more audio accepted.

The cloud API's input_audio.end maps to finish. There is no flush equivalent.

Without a non-terminal flush, any application doing multi-turn transcription (voice agents, live captioning with speaker pauses, etc.) must create a new WebSocket per utterance, paying ~200-500ms connection setup overhead each time. The model also buffers the last word of every utterance until end_audio(), forcing idle-timeout heuristics that add latency and sometimes truncate transcripts.

Suggested Solutions

Add input_audio.flush as a client-to-server message:

{"type": "input_audio.flush"}

On receiving this, the model processes all buffered audio, emits remaining transcription.text.delta events and a transcription.done with the complete utterance text. The audio stream stays open for subsequent input_audio.append messages. This mirrors vox_stream_flush() in voxtral.c.

SDK addition:

async def flush_audio(self) -> None:
    """Flush buffered audio without ending the stream."""
    if self._closed:
        raise RuntimeError("Connection is closed")
    await self._websocket.send(json.dumps({"type": "input_audio.flush"}))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions