# Realtime Voice

Build interactive voice conversation apps with realtime AI models.

## Overview

Realtime voice models enable two-way audio conversations with AI. Unlike standard model runs that process a single input and return a result, realtime sessions maintain a persistent WebSocket connection where you stream microphone audio and receive AI speech in real time.

The flow is:

1. **Run** the realtime model via [POST /Run](/docs/run-a-model) to get a `socketaccesstoken`
2. **Connect** to the WebSocket and send `task_info` with your token
3. **Wait** for `task_stream_ready` — the model is ready to receive audio
4. **Stream** microphone audio as binary frames
5. **Receive** AI audio as binary frames and play them
6. **End** the session with `task_session_end`

## Connection & Registration

After running the task, connect to the WebSocket and register with `task_info` :

```javascript
var ws = new WebSocket("wss://socket.wiro.ai/v1");

ws.onopen = function() {
  ws.send(JSON.stringify({
    type: "task_info",
    tasktoken: "YOUR_SOCKET_ACCESS_TOKEN"
  }));
};
```

> **Note:** Both standard and realtime models use `type: "task_info"` with `tasktoken` to register on the WebSocket.

## Realtime Events

During a realtime session, you'll receive these WebSocket events:

| Event | Description |
|-------|-------------|
| `task_stream_ready` | Session is ready — start sending microphone audio |
| `task_stream_end` | AI finished speaking for this turn — you can speak again |
| `task_cost` | Cost update per turn — includes `turnCost`, `cumulativeCost`, and `usage` (raw cost breakdown from the model provider) |
| `task_output` | Transcript messages prefixed with `TRANSCRIPT_USER:` or `TRANSCRIPT_AI:` |
| `task_end` | Session fully ended — close the connection |

## Audio Format

Both directions (microphone → server, server → client) use the same format:

| Property | Value |
|----------|-------|
| Format | PCM (raw, uncompressed) |
| Bit depth | 16-bit signed integer (Int16) |
| Sample rate | 24,000 Hz (24 kHz) |
| Channels | Mono (1 channel) |
| Byte order | Little-endian |
| Chunk size | 4,800 samples (200 ms) = 9,600 bytes |

### Binary Frame Format

Every binary WebSocket frame (in both directions) is structured as:

```
[tasktoken]|[PCM audio data]
```

The pipe character `|` (0x7C) separates the token from the raw audio bytes.

## Sending Microphone Audio

Capture microphone at 24 kHz using the Web Audio API with an AudioWorklet. Convert Float32 samples to Int16, prepend your task token, and send as a binary frame.

Key steps:

1. Request microphone with `getUserMedia` (enable echo cancellation and noise suppression)
2. Create an `AudioContext` at 24,000 Hz sample rate
3. Use an AudioWorklet to buffer and convert samples to Int16
4. Send each chunk as `tasktoken|pcm_data` binary frame

## Receiving AI Audio

AI responses arrive as binary WebSocket frames in the same PCM Int16 24 kHz format. To play them:

1. Check if the message is a `Blob` (binary) before parsing as JSON
2. Find the pipe `|` separator and extract audio data after it
3. Convert Int16 → Float32 and create an `AudioBuffer`
4. Schedule gapless playback using `AudioBufferSourceNode`

## Transcripts

Both user and AI speech are transcribed automatically. Transcripts arrive as `task_output` messages with a string prefix:

- `TRANSCRIPT_USER:` — what the user said
- `TRANSCRIPT_AI:` — what the AI said

```json
// Example task_output message
{
  "type": "task_output",
  "message": "TRANSCRIPT_USER:What's the weather like today?"
}

{
  "type": "task_output",
  "message": "TRANSCRIPT_AI:I'd be happy to help, but I don't have access to real-time weather data."
}
```

## Ending a Session

To gracefully end a realtime session, send `task_session_end`:

```json
{
  "type": "task_session_end",
  "tasktoken": "YOUR_SOCKET_ACCESS_TOKEN"
}
```

After sending this, the server will process any remaining audio, send final cost/transcript events, and then emit `task_end`. Wait for `task_end` before closing the WebSocket.

> **Safety:** If the client disconnects without sending `task_session_end`, the server automatically terminates the session to prevent the pipeline from running indefinitely (and the provider from continuing to charge). Always send `task_session_end` explicitly for a clean shutdown.

> **Insufficient balance:** If the wallet runs out of balance during a realtime session, the server automatically stops the session. You will still receive the final `task_cost` and `task_end` events.

## Code Examples

### JavaScript

```javascript
// Realtime Voice Session — Connect and Handle Events

var socketToken = 'YOUR_SOCKET_ACCESS_TOKEN';
var ws = new WebSocket('wss://socket.wiro.ai/v1');

ws.onopen = function() {
  ws.send(JSON.stringify({ type: 'task_info', tasktoken: socketToken }));
};

ws.onmessage = function(event) {
  if (event.data instanceof Blob) {
    handleAudioResponse(event.data);
    return;
  }

  var msg = JSON.parse(event.data);

  if (msg.type === 'task_stream_ready') {
    console.log('Session ready — start microphone');
    startMicrophone(ws, socketToken);
  }

  if (msg.type === 'task_stream_end') {
    console.log('AI finished speaking');
  }

  if (msg.type === 'task_cost') {
    console.log('Turn cost:', msg.turnCost,
      'Total:', msg.cumulativeCost);
  }

  if (msg.type === 'task_output' &&
      typeof msg.message === 'string') {
    if (msg.message.startsWith('TRANSCRIPT_USER:')) {
      console.log('You:', msg.message.substring(16));
    }
    if (msg.message.startsWith('TRANSCRIPT_AI:')) {
      console.log('AI:', msg.message.substring(14));
    }
  }

  if (msg.type === 'task_end') {
    console.log('Session ended');
    stopMicrophone();
    ws.close();
  }
};

function endSession() {
  ws.send(JSON.stringify({
    type: 'task_session_end',
    tasktoken: socketToken
  }));
}
```

### Mic Capture

```javascript
// Microphone capture at 24kHz PCM Int16
// Binary frame: tasktoken|pcm_data

var audioCtx, workletNode, micStream;

async function startMicrophone(ws, token) {
  micStream = await navigator.mediaDevices
    .getUserMedia({
      audio: {
        echoCancellation: true,
        noiseSuppression: true
      }
    });

  audioCtx = new AudioContext({ sampleRate: 24000 });

  // Inline AudioWorklet processor
  var code = `
    class P extends AudioWorkletProcessor {
      constructor() { super(); this.buf = new Float32Array(0); }
      process(inputs) {
        var inp = inputs[0] && inputs[0][0];
        if (!inp) return true;
        var nb = new Float32Array(this.buf.length + inp.length);
        nb.set(this.buf);
        nb.set(inp, this.buf.length);
        this.buf = nb;
        while (this.buf.length >= 4800) {
          var c = this.buf.slice(0, 4800);
          this.buf = this.buf.slice(4800);
          var i16 = new Int16Array(c.length);
          for (var i = 0; i < c.length; i++) {
            var s = Math.max(-1, Math.min(1, c[i]));
            i16[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
          }
          this.port.postMessage(i16.buffer, [i16.buffer]);
        }
        return true;
      }
    }
    registerProcessor('p', P);
  `;
  var blob = new Blob([code], { type: 'application/javascript' });
  await audioCtx.audioWorklet.addModule(
    URL.createObjectURL(blob)
  );

  var src = audioCtx.createMediaStreamSource(micStream);
  workletNode = new AudioWorkletNode(audioCtx, 'p');
  src.connect(workletNode);

  workletNode.port.onmessage = function(e) {
    var tokenBytes = new TextEncoder()
      .encode(token + '|');
    var frame = new Uint8Array(
      tokenBytes.length + e.data.byteLength
    );
    frame.set(tokenBytes, 0);
    frame.set(new Uint8Array(e.data), tokenBytes.length);
    if (ws.readyState === 1) ws.send(frame.buffer);
  };
}

function stopMicrophone() {
  if (workletNode) workletNode.disconnect();
  if (audioCtx) audioCtx.close();
  if (micStream)
    micStream.getTracks().forEach(t => t.stop());
}
```

### Play Audio

```javascript
// Receive and play AI audio (PCM Int16 24kHz)

var playCtx = new AudioContext({ sampleRate: 24000 });
var nextPlayTime = 0;

function handleAudioResponse(blob) {
  blob.arrayBuffer().then(function(buffer) {
    var bytes = new Uint8Array(buffer);

    // Find pipe separator
    var pipeIndex = bytes.indexOf(0x7C);
    if (pipeIndex < 0) return;

    var audioData = buffer.slice(pipeIndex + 1);

    // Convert Int16 → Float32
    var int16 = new Int16Array(audioData);
    var float32 = new Float32Array(int16.length);
    for (var i = 0; i < int16.length; i++) {
      float32[i] = int16[i] / 32768.0;
    }

    // Schedule gapless playback
    var audioBuf = playCtx.createBuffer(
      1, float32.length, 24000
    );
    audioBuf.getChannelData(0).set(float32);

    var src = playCtx.createBufferSource();
    src.buffer = audioBuf;
    src.connect(playCtx.destination);

    var now = playCtx.currentTime;
    var t = Math.max(now, nextPlayTime);
    src.start(t);
    nextPlayTime = t + audioBuf.duration;
  });
}
```

### Python

```python
import asyncio
import json
import struct
import websockets
import pyaudio

SOCKET_TOKEN = 'YOUR_SOCKET_ACCESS_TOKEN'
SAMPLE_RATE = 24000
CHUNK = 4800  # 200ms

async def realtime_session():
    uri = 'wss://socket.wiro.ai/v1'
    async with websockets.connect(uri) as ws:
        # Register
        await ws.send(json.dumps({
            'type': 'task_info',
            'tasktoken': SOCKET_TOKEN
        }))

        # Audio setup
        pa = pyaudio.PyAudio()
        mic = pa.open(format=pyaudio.paInt16,
            channels=1, rate=SAMPLE_RATE,
            input=True, frames_per_buffer=CHUNK)
        speaker = pa.open(format=pyaudio.paInt16,
            channels=1, rate=SAMPLE_RATE,
            output=True)

        is_listening = False

        async def send_audio():
            while True:
                if is_listening:
                    data = mic.read(CHUNK,
                        exception_on_overflow=False)
                    token = SOCKET_TOKEN.encode() + b'|'
                    await ws.send(token + data)
                await asyncio.sleep(0.01)

        async def receive():
            nonlocal is_listening
            async for msg in ws:
                if isinstance(msg, bytes):
                    pipe = msg.index(0x7C)
                    speaker.write(msg[pipe+1:])
                    continue
                data = json.loads(msg)
                t = data['type']
                if t == 'task_stream_ready':
                    print('Session ready')
                    is_listening = True
                elif t == 'task_stream_end':
                    is_listening = False
                elif t == 'task_cost':
                    is_listening = True
                    print(f'Cost: {data["cumulativeCost"]}')
                elif t == 'task_output':
                    m = data.get('message', '')
                    if m.startswith('TRANSCRIPT_USER:'):
                        print(f'You: {m[16:]}')
                    elif m.startswith('TRANSCRIPT_AI:'):
                        print(f'AI: {m[14:]}')
                elif t in ('task_end',
                    'task_postprocess_end'):
                    print('Session ended')
                    break

        await asyncio.gather(
            send_audio(), receive())

asyncio.run(realtime_session())
```

### Node.js

```javascript
const WebSocket = require('ws');
const { spawn } = require('child_process');

const SOCKET_TOKEN = 'YOUR_SOCKET_ACCESS_TOKEN';
const ws = new WebSocket('wss://socket.wiro.ai/v1');

ws.on('open', () => {
  ws.send(JSON.stringify({
    type: 'task_info',
    tasktoken: SOCKET_TOKEN
  }));
});

ws.on('message', (data, isBinary) => {
  if (isBinary) {
    const buf = Buffer.from(data);
    const pipe = buf.indexOf(0x7C);
    if (pipe > 0) {
      const audio = buf.slice(pipe + 1);
      // Play via speaker or save to file
      console.log('Audio chunk:', audio.length, 'bytes');
    }
    return;
  }

  const msg = JSON.parse(data.toString());

  if (msg.type === 'task_stream_ready') {
    console.log('Session ready — start sending audio');
    // Start mic capture and send as binary
  }

  if (msg.type === 'task_cost') {
    console.log('Cost:', msg.cumulativeCost);
  }

  if (msg.type === 'task_output' &&
      typeof msg.message === 'string') {
    if (msg.message.startsWith('TRANSCRIPT_USER:'))
      console.log('You:', msg.message.substring(16));
    if (msg.message.startsWith('TRANSCRIPT_AI:'))
      console.log('AI:', msg.message.substring(14));
  }

  if (msg.type === 'task_end') {
    console.log('Done');
    ws.close();
  }
});

// End session
function endSession() {
  ws.send(JSON.stringify({
    type: 'task_session_end',
    tasktoken: SOCKET_TOKEN
  }));
}
```

### Format

```json
{
  "audio_format": {
    "codec": "PCM",
    "bit_depth": "16-bit Int16",
    "sample_rate": 24000,
    "channels": 1,
    "byte_order": "little-endian"
  },
  "binary_frame": "tasktoken|pcm_audio_data",
  "recommended_chunk": {
    "samples": 4800,
    "duration_ms": 200,
    "bytes": 9600
  },
  "events": {
    "task_stream_ready": "Start sending audio",
    "task_stream_end": "AI finished speaking",
    "task_cost": "Cost per turn + cumulative",
    "task_output": "TRANSCRIPT_USER: / TRANSCRIPT_AI:",
    "task_session_end": "Send to end session",
    "task_end": "Server ended session"
  }
}
```