# Realtime Text to Speech

Build streaming text-to-speech apps with realtime AI models.

## Overview

Realtime text-to-speech models convert text into streaming audio. Unlike standard TTS that processes a full prompt and returns an audio file, realtime TTS streams AI-generated speech as a continuous PCM audio stream over a WebSocket connection — in real time. The text prompt is submitted via [POST /Run](/docs/run-a-model), not over the WebSocket. The WebSocket carries only task events (`task_info`, `task_stream_ready`, etc.), binary audio frames, and control messages (`task_session_end`).

No microphone is required. The flow is one-directional: text goes in, audio comes out.

The flow is:

1. **Run** the realtime TTS model via [POST /Run](/docs/run-a-model) with your text prompt in the parameters
2. **Connect** to the WebSocket and send `task_info` with your `socketaccesstoken`
3. **Wait** for `task_stream_ready` — the model has loaded and is generating audio
4. **Receive** AI audio as binary frames and play them
5. **End** the session with `task_session_end` or wait for the stream to finish naturally

## How It Differs from Realtime Voice Conversation

| | Realtime Voice Conversation | Realtime Text to Speech |
|---|---|---|
| Input | Microphone audio (streamed) | Text (sent with the run request) |
| Output | AI audio + transcripts | AI audio only |
| Direction | Bidirectional (client ↔ server) | Server → client only |
| Microphone | Required | Not required |
| Transcripts | `TRANSCRIPT_USER:` / `TRANSCRIPT_AI:` via `task_output` | None |
| Use case | Interactive voice chat | Narration, voiceover, assistants |

## Connection & Registration

After running the task, connect to the WebSocket and register with `task_info`:

```javascript
var ws = new WebSocket("wss://socket.wiro.ai/v1");

ws.onopen = function() {
  ws.send(JSON.stringify({
    type: "task_info",
    tasktoken: "YOUR_SOCKET_ACCESS_TOKEN"
  }));
};
```

> **Note:** Both standard and realtime models use `type: "task_info"` with `tasktoken` to register on the WebSocket. The registration flow is identical to [Realtime Voice Conversation](/docs/realtime-voice-conversation).

## Realtime Events

During a realtime TTS session, you'll receive these WebSocket events:

| Event | Description |
|-------|-------------|
| `task_stream_ready` | Session is ready — the model is generating audio and will begin sending chunks |
| `task_stream_end` | The model finished generating audio for the current segment |
| `task_cost` | Cost update — includes `turnCost`, `cumulativeCost`, and `usage` (raw cost breakdown from the model provider) |
| `task_end` | The model process has exited. Post-processing follows — wait for `task_postprocess_end` to close the connection. |
| `task_postprocess_end` | Post-processing is complete. Safe to close the WebSocket connection. |

> **No `task_output` events.** Unlike voice conversation, TTS sessions do not produce transcript events. The input text is already known (you provided it), and the AI output is audio, not text.

### Event Sequence

A typical TTS session produces events in this order:

```
task_stream_ready     ← model is ready, audio chunks start arriving
[binary frames]       ← PCM audio data (many frames)
task_stream_end       ← audio generation complete for this segment
task_cost             ← cost for this segment
task_end              ← model process exiting
task_postprocess_end  ← safe to close WebSocket
```

## Audio Format

Audio flows in one direction only: **server → client**. The client does not send any audio.

| Property | Value |
|----------|-------|
| Format | PCM (raw, uncompressed) |
| Bit depth | 16-bit signed integer (Int16) |
| Sample rate | 24,000 Hz (24 kHz) |
| Channels | Mono (1 channel) |
| Byte order | Little-endian |
| Chunk size | Variable (typically 200 ms = 4,800 samples = 9,600 bytes) |

### Binary Frame Format

Every binary WebSocket frame from the server is structured as:

```
[tasktoken]|[PCM audio data]
```

The pipe character `|` (0x7C) separates the token from the raw audio bytes. To extract the audio:

1. Find the first `|` byte in the binary frame
2. Everything after it is raw PCM Int16 audio data
3. Convert Int16 samples to your playback format (e.g., Float32 for Web Audio API)

> **Client → server:** In TTS mode, you do not send binary audio frames. The only messages you send are `task_info` (to register) and `task_session_end` (to end the session).

## Receiving AI Audio

AI speech arrives as binary WebSocket frames in PCM Int16 24 kHz format. To play them:

1. Check if the incoming message is binary (a `Blob` in JavaScript, `bytes` in Python) before attempting JSON parse
2. Find the pipe `|` separator and extract audio data after it
3. Convert Int16 → Float32 and create an `AudioBuffer`
4. Schedule gapless playback using `AudioBufferSourceNode` to avoid clicks between chunks

### Gapless Playback

Audio arrives in many small chunks. To play them seamlessly:

- Track a `nextPlayTime` variable initialized to `0`
- For each chunk, schedule it at `max(audioContext.currentTime, nextPlayTime)`
- Advance `nextPlayTime` by the chunk's duration
- This ensures chunks play back-to-back with no gaps or overlaps

## Ending a Session

To gracefully end a realtime TTS session, send `task_session_end`:

```json
{
  "type": "task_session_end",
  "tasktoken": "YOUR_SOCKET_ACCESS_TOKEN"
}
```

After sending this, the server will finish any in-progress generation, send final cost events, and then emit `task_postprocess_end`. Wait for `task_postprocess_end` before closing the WebSocket.

For TTS sessions, the stream often ends naturally when the model finishes generating audio for the provided text. In this case, you'll receive `task_stream_end` followed by `task_end` without needing to send `task_session_end`. However, it's good practice to send it explicitly for a clean shutdown, especially if you want to stop playback early.

> **Safety:** If the client disconnects without sending `task_session_end`, the server automatically terminates the session to prevent the pipeline from running indefinitely (and the provider from continuing to charge). Always send `task_session_end` explicitly for a clean shutdown.

> **Insufficient balance:** If the wallet runs out of balance during a realtime session, the server automatically stops the session. You will still receive the final `task_cost` and `task_end` events.

## Code Examples

### JavaScript

```javascript
// Realtime TTS Session — Connect, Receive Audio, and Play

var socketToken = 'YOUR_SOCKET_ACCESS_TOKEN';
var ws = new WebSocket('wss://socket.wiro.ai/v1');

// Audio playback state
var playCtx = new AudioContext({ sampleRate: 24000 });
var nextPlayTime = 0;

ws.onopen = function() {
  ws.send(JSON.stringify({
    type: 'task_info',
    tasktoken: socketToken
  }));
};

ws.onmessage = function(event) {
  if (event.data instanceof Blob) {
    playAudioChunk(event.data); // defined in "Play Audio" section below
    return;
  }

  var msg = JSON.parse(event.data);

  if (msg.type === 'task_stream_ready') {
    console.log('TTS stream ready — audio chunks incoming');
    nextPlayTime = 0;
  }

  if (msg.type === 'task_stream_end') {
    console.log('Audio generation complete');
  }

  if (msg.type === 'task_cost') {
    console.log('Turn cost:', msg.turnCost,
      'Total:', msg.cumulativeCost);
  }

  if (msg.type === 'task_end') {
    console.log('Session ended');
  }

  if (msg.type === 'task_postprocess_end') {
    console.log('Post-processing done — closing');
    ws.close();
  }
};

function endSession() {
  ws.send(JSON.stringify({
    type: 'task_session_end',
    tasktoken: socketToken
  }));
}
```

### Play Audio

```javascript
// Receive and play AI audio (PCM Int16 24kHz)
// Gapless scheduling ensures smooth, uninterrupted playback

function playAudioChunk(blob) {
  blob.arrayBuffer().then(function(buffer) {
    var bytes = new Uint8Array(buffer);

    // Find pipe separator between token and audio
    var pipeIndex = bytes.indexOf(0x7C);
    if (pipeIndex < 0) return;

    var audioData = buffer.slice(pipeIndex + 1);
    if (audioData.byteLength === 0) return;

    // Convert Int16 → Float32 for Web Audio API
    var int16 = new Int16Array(audioData);
    var float32 = new Float32Array(int16.length);
    for (var i = 0; i < int16.length; i++) {
      float32[i] = int16[i] / 32768.0;
    }

    // Create AudioBuffer and schedule gapless playback
    var audioBuf = playCtx.createBuffer(
      1, float32.length, 24000
    );
    audioBuf.getChannelData(0).set(float32);

    var src = playCtx.createBufferSource();
    src.buffer = audioBuf;
    src.connect(playCtx.destination);

    var now = playCtx.currentTime;
    var startAt = Math.max(now, nextPlayTime);
    src.start(startAt);
    nextPlayTime = startAt + audioBuf.duration;
  });
}
```

### Python

```python
import asyncio
import json
import websockets
import pyaudio

SOCKET_TOKEN = 'YOUR_SOCKET_ACCESS_TOKEN'
SAMPLE_RATE = 24000

async def tts_session():
    uri = 'wss://socket.wiro.ai/v1'
    async with websockets.connect(uri) as ws:
        # Register
        await ws.send(json.dumps({
            'type': 'task_info',
            'tasktoken': SOCKET_TOKEN
        }))

        # Audio output setup (speaker only — no mic needed)
        pa = pyaudio.PyAudio()
        speaker = pa.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=SAMPLE_RATE,
            output=True
        )

        session_active = True

        async def receive():
            nonlocal session_active
            async for msg in ws:
                if isinstance(msg, bytes):
                    # Binary frame: tasktoken|pcm_data
                    pipe = msg.find(0x7C)
                    if pipe == -1:
                        continue
                    audio = msg[pipe + 1:]
                    speaker.write(audio)
                    continue

                data = json.loads(msg)
                t = data['type']

                if t == 'task_stream_ready':
                    print('TTS stream ready', flush=True)

                elif t == 'task_stream_end':
                    print('Audio generation complete', flush=True)

                elif t == 'task_cost':
                    print(f'Cost: {data["cumulativeCost"]}', flush=True)

                elif t in ('task_end', 'task_postprocess_end'):
                    print('Session ended', flush=True)
                    session_active = False
                    break

        try:
            await receive()
        finally:
            speaker.stop_stream()
            speaker.close()
            pa.terminate()

asyncio.run(tts_session())
```

### Node.js

```javascript
const WebSocket = require('ws');

const SOCKET_TOKEN = 'YOUR_SOCKET_ACCESS_TOKEN';
const ws = new WebSocket('wss://socket.wiro.ai/v1');

var audioChunks = [];

ws.on('open', () => {
  ws.send(JSON.stringify({
    type: 'task_info',
    tasktoken: SOCKET_TOKEN
  }));
});

ws.on('message', (data, isBinary) => {
  if (isBinary) {
    const buf = Buffer.from(data);
    const pipe = buf.indexOf(0x7C);
    if (pipe !== -1) {
      const audio = buf.slice(pipe + 1);
      audioChunks.push(audio);
      console.log('Audio chunk:', audio.length, 'bytes');
    }
    return;
  }

  const msg = JSON.parse(data.toString());

  if (msg.type === 'task_stream_ready') {
    console.log('TTS stream ready — receiving audio');
    audioChunks = [];
  }

  if (msg.type === 'task_stream_end') {
    console.log('Audio complete:',
      audioChunks.length, 'chunks received');
    // Concatenate and save or pipe to speaker
    var total = Buffer.concat(audioChunks);
    console.log('Total audio:', total.length, 'bytes',
      '(' + (total.length / 2 / 24000).toFixed(1) + 's)');
  }

  if (msg.type === 'task_cost') {
    console.log('Cost:', msg.cumulativeCost);
  }

  if (msg.type === 'task_end') {
    console.log('Done');
  }

  if (msg.type === 'task_postprocess_end') {
    ws.close();
  }
});

// End session early (optional — stream ends naturally)
function endSession() {
  ws.send(JSON.stringify({
    type: 'task_session_end',
    tasktoken: SOCKET_TOKEN
  }));
}
```

### Format

```json
{
  "audio_format": {
    "codec": "PCM",
    "bit_depth": "16-bit Int16",
    "sample_rate": 24000,
    "channels": 1,
    "byte_order": "little-endian"
  },
  "binary_frame": {
    "direction": "server → client",
    "format": "tasktoken|pcm_audio_data",
    "separator": "|",
    "separator_byte": "0x7C"
  },
  "typical_chunk": {
    "samples": 4800,
    "duration_ms": 200,
    "bytes": 9600
  },
  "events": {
    "task_stream_ready": "Audio generation started",
    "task_stream_end": "Audio generation complete",
    "task_cost": "Cost per turn + cumulative",
    "task_session_end": "Send to end session early",
    "task_end": "Server ended session",
    "task_postprocess_end": "Post-processing complete — safe to close"
  }
}
```