Speech-to-Text APIs in 2026: One Audio Clip, Two Modern Transcribers
This post tests two current speech-to-text APIs on Wiro using the same short MP3. The clip includes numbers and model names to stress common failure points.
Audio sample
Expected transcript (what the speaker says)
Hi. This is a 2026 speech to text benchmark on Wiro. It includes numbers like 3.5, 720p, and 1,024. Proper nouns: Kling, Seedance, PixVerse, Hailuo. End.
Models tested
Results
qwen/qwen3-asr-1.7b
Elapsed processing time: 45s.
Language: English Text: Hi, this is a 20th round six-page-to-text benchmark on Weiro. It includes numbers like 3.5, 720p, and 1024, proper nouns, hilling, students, pigs verse, hailuo, and.
elevenlabs/speech-to-text
Elapsed processing time: 4s.
Hi, this is a 20th drawn six speech-to-text benchmark on Weiro. It includes numbers like 3.5, 720p, and 1024; proper nouns, hyelin; sedents, pixvers, hyluo; end.
Quick comparison table
| Model | Elapsed seconds | What to watch for |
|---|---|---|
| Qwen3 ASR 1.7B | 45 | Numbers and punctuation vs. proper nouns |
| ElevenLabs STT | 4 | Speed vs. name accuracy |