Nemotron vs Whisper: two very different ASR approaches
NVIDIA Nemotron-Speech-Streaming-En-0.6b targets low-latency streaming transcription (chunked audio) with punctuation and capitalization support. OpenAI Whisper Large V3 is a general-purpose speech recognition model trained at large scale and widely used for offline transcription.
This post runs a small 5-audio test set and compares the raw text outputs side by side.

Model links
What was tested
- Clean narration (basic accuracy)
- Punctuation and capitalization behavior
- Names and uncommon words (Vera, game-cock)
- Long sentence handling
- Customer support style numbers and phrasing
Inputs used
- Nemotron:
inputAudio(audio URL) - Whisper Large V3:
inputAudioUrl,language=auto,maxNewTokens=256,chunkLength=30,batchSize=8,numSpeakers=1(Whisper output includes timestamps/segments)
Run-time snapshot (elapsed seconds)
| Test | Nemotron | Whisper Large V3 |
|---|---|---|
| 01 | 54 | 21 |
| 02 | 3 | 5 |
| 03 | 32 | 26 |
| 04 | 3 | 4 |
| 05 | 5 | 4 |
Results: 5 audio clips with transcripts
| Test audio | Nemotron output | Whisper Large V3 output |
|---|---|---|
| Test 01 audio | persons who knows that they will not be able to rest along the way when they took a path will never get tired |
00:00.2 - 00:05.9 / Persons who knows that they will not be able to rest along the way when they took a path will never get tired. |
| Test 02 audio | going along slushy country roads and speaking to damp audiences in drafty schoolrooms day after day for a fortnight he'll have to put in an appearance at some place of worship on sunday morning and he can come to us immediately afterwards |
00:00.0 - 00:06.6 / going along slushy country roads and speaking to damp audiences in draughty schoolrooms day after day for a fortnight |
| Test 03 audio | before he had time to answer a much encumbered vera burst into the room with the question i say can i leave these here these were a small black pig and a lusty specimen of black red gamecock |
00:00.5 - 00:07.6 / before he had time to answer, a much-encumbered Vera burst into the room with the question, ìI say, can I leave these here?î |
| Test 04 audio | i received a birthday gift from a friend who sent it from afar that unexpected surprise and deep blessing filled my heart with sweet happiness and my smile bloomed like a flower |
00:00.0 - 00:03.3 / I received a birthday gift from a friend who sent it from afar. |
| Test 05 audio | i completely understand the frustration you're experiencing technical issues are never convenient to help me resolve this for you immediately could you please confirm the last four digits of your account number |
00:00.3 - 00:03.2 / I completely understand the frustration you're experiencing. |
Honest take
- Nemotron returned clean, single-line text for all five clips, but it lowercased everything and dropped punctuation.
- Whisper Large V3 returned segmented output with timestamps and generally kept punctuation and capitalization, but one clip shows odd quote characters (ì and î) in the text output.
- If you need streaming-first ASR behavior, Nemotron has the right shape. If you want timestamped segments out of the box, Whisper is convenient.