[BUG CLIENT]: Realtime transcription WebSocket lacks non-terminal flush — model buffers last word(s) until `input_audio.end`

### Python -VV

```shell
Python 3.12.4 (main, Jun  6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]
```

### Pip Freeze

```shell
mistralai==1.12.2
websockets==15.0.1
```

### Reproduction Steps

1. Open a realtime transcription WebSocket to `voxtral-mini-transcribe-realtime-2602`.
2. Stream PCM audio of a user saying "Who built the White Tower" via `input_audio.append`.
3. Audio finishes. Do not send `input_audio.end`. Observe `transcription.text.delta` events.
4. Model emits deltas for `Who`, ` built`, ` the`, ` White` — then stops. `Tower` never arrives.
5. 30 seconds later, server returns error 3804: "Timeout waiting for response from streaming transcription."
6. Sending `input_audio.end` does flush `Tower` and triggers `transcription.done`, but the connection becomes terminal — no more audio accepted.

Reproducible on every utterance. The model holds the last 1-2 tokens until it receives `input_audio.end`.

```python
import asyncio
from mistralai import Mistral
from mistralai.models import AudioFormat

async def main():
    client = Mistral(api_key="YOUR_KEY")
    rt = client.audio.realtime.transcription
    async with await rt.connect(
        model="voxtral-mini-transcribe-realtime-2602",
        audio_format=AudioFormat(encoding="pcm_s16le", sample_rate=16000),
    ) as conn:
        for chunk in audio_chunks:
            await conn.send_audio(chunk)

        # conn.end_audio() flushes buffered tokens, but kills the connection.
        # No alternative exists to flush without ending.

        async for event in conn.events():
            print(event)  # last word never arrives without end_audio()

asyncio.run(main())
```

### Expected Behavior

The SDK needs a non-terminal flush: "emit all buffered tokens and `transcription.done` for the current utterance, then accept new audio on the same WebSocket."

### Additional Context

The native C implementation ([antirez/voxtral.c](https://github.com/antirez/voxtral.c)) already distinguishes these two operations:

- **`vox_stream_flush()`** — processes buffered audio, emits pending tokens. Keeps the stream open. Used on silence detection between utterances.
- **`vox_stream_finish()`** — terminal. No more audio accepted.

The cloud API's `input_audio.end` maps to `finish`. There is no `flush` equivalent.

Without a non-terminal flush, any application doing multi-turn transcription (voice agents, live captioning with speaker pauses, etc.) must create a new WebSocket per utterance, paying ~200-500ms connection setup overhead each time. The model also buffers the last word of every utterance until `end_audio()`, forcing idle-timeout heuristics that add latency and sometimes truncate transcripts.

### Suggested Solutions

Add `input_audio.flush` as a client-to-server message:

```json
{"type": "input_audio.flush"}
```

On receiving this, the model processes all buffered audio, emits remaining `transcription.text.delta` events and a `transcription.done` with the complete utterance text. The audio stream stays open for subsequent `input_audio.append` messages. This mirrors `vox_stream_flush()` in voxtral.c.

SDK addition:

```python
async def flush_audio(self) -> None:
    """Flush buffered audio without ending the stream."""
    if self._closed:
        raise RuntimeError("Connection is closed")
    await self._websocket.send(json.dumps({"type": "input_audio.flush"}))
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG CLIENT]: Realtime transcription WebSocket lacks non-terminal flush — model buffers last word(s) until `input_audio.end` #357

Python -VV

Pip Freeze

Reproduction Steps

Expected Behavior

Additional Context

Suggested Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG CLIENT]: Realtime transcription WebSocket lacks non-terminal flush — model buffers last word(s) until input_audio.end #357

Description

Python -VV

Pip Freeze

Reproduction Steps

Expected Behavior

Additional Context

Suggested Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG CLIENT]: Realtime transcription WebSocket lacks non-terminal flush — model buffers last word(s) until `input_audio.end` #357