VibeVoice Realtime: real-time TTS in 6 tests
VibeVoice Realtime is a text-to-speech model that targets low-latency voice output and long-form stability. This post runs six short but practical prompts and publishes the raw MP3 outputs.
Model link
What was tested
- Short ad read (timing, emphasis)
- DevOps style numbers and acronyms (UTC, v2, HTTP)
- Longer checklist paragraph (rhythm and breath)
- Meeting recap (prosody across sentences)
- German output with a German voice
- Tongue twisters (hard articulation)
Inputs used
The runs used only three inputs from the model docs: prompt, speakerName, and scale.
Run-time snapshot
| Test | Speaker | Elapsed seconds |
|---|---|---|
| 01 | en-emma_woman | 8 |
| 02 | en-davis_man | 10 |
| 03 | en-grace_woman | 13 |
| 04 | en-carter_man | 16 |
| 05 | de-spk1_woman | 10 |
| 06 | en-mike_man | 31 |
Results: 6 prompts with audio
Test 01: short product ad read
Prompt:
New drop. Stainless steel watch, matte black dial, 10 percent off today. Free shipping, delivery in 2 to 3 business days.
Test 02: numbers, acronyms, and ops language
Prompt:
Deploy v2 at 14:05 UTC. Roll back if error rate exceeds 0.7 percent. Log the request id, the JSON payload size, and the HTTP status code.
Test 03: checklist pacing
Prompt:
Onboarding checklist. Step one, verify email. Step two, create an API key. Step three, run a smoke test with two prompts. Step four, set timeouts and retries. Step five, ship.
Test 04: sentence-level prosody
Prompt:
Meeting recap. First, the team agreed to cut the scope. Next, a quick demo shipped with a single button. Finally, a bug fix went out before lunch. Action items follow.
Test 05: German voice
Prompt:
Achtung. Bitte lesen Sie die Anleitung. Seriennummer DE 77 2048. Garantie 24 Monate. Bei Fragen, schreiben Sie dem Support.
Test 06: hard articulation
Prompt:
Hard test. She sells seashells by the seashore. Red leather, yellow leather. Unique New York. Say it three times, clearly.
Honest take
- The voice stays clear on short prompts. The cadence sounds steady.
- Ops text works well when punctuation is explicit (commas and periods). Without it, acronyms can blur.
- Speaker choice matters more than scale for the perceived style. Testing a few voices before shipping pays off.