Nemotron vs Whisper Large V3: 5 Audio Transcription Tests

Nemotron vs Whisper: two very different ASR approaches

NVIDIA Nemotron-Speech-Streaming-En-0.6b targets low-latency streaming transcription (chunked audio) with punctuation and capitalization support. OpenAI Whisper Large V3 is a general-purpose speech recognition model trained at large scale and widely used for offline transcription.

This post runs a small 5-audio test set and compares the raw text outputs side by side.

Microphone and audio waveform illustration — Prompt: Cinematic studio photo of a matte black microphone on a desk with a soft glowing audio waveform light trail in the background, dark gradient backdrop, shallow depth of field, clean minimal tech aesthetic, high contrast, no text, no logos, no watermark

Model links

What was tested

Clean narration (basic accuracy)
Punctuation and capitalization behavior
Names and uncommon words (Vera, game-cock)
Long sentence handling
Customer support style numbers and phrasing

Inputs used

Nemotron: inputAudio (audio URL)
Whisper Large V3: inputAudioUrl, language=auto, maxNewTokens=256, chunkLength=30, batchSize=8, numSpeakers=1 (Whisper output includes timestamps/segments)

Run-time snapshot (elapsed seconds)

Test	Nemotron	Whisper Large V3
01	54	21
02	3	5
03	32	26
04	3	4
05	5	4

Results: 5 audio clips with transcripts

Test audio	Nemotron output	Whisper Large V3 output
Test 01 audio	`persons who knows that they will not be able to rest along the way when they took a path will never get tired`	`00:00.2 - 00:05.9 / Persons who knows that they will not be able to rest along the way when they took a path will never get tired.`
Test 02 audio	`going along slushy country roads and speaking to damp audiences in drafty schoolrooms day after day for a fortnight he'll have to put in an appearance at some place of worship on sunday morning and he can come to us immediately afterwards`	`00:00.0 - 00:06.6 / going along slushy country roads and speaking to damp audiences in draughty schoolrooms day after day for a fortnight 00:07.3 - 00:13.5 / He'll have to put in an appearance at some place of worship on Sunday morning, and he can come to us immediately afterwards.`
Test 03 audio	`before he had time to answer a much encumbered vera burst into the room with the question i say can i leave these here these were a small black pig and a lusty specimen of black red gamecock`	`00:00.5 - 00:07.6 / before he had time to answer, a much-encumbered Vera burst into the room with the question, ìI say, can I leave these here?î 00:08.5 - 00:13.7 / These were a small black pig and a lusty specimen of black-red game-cock,`
Test 04 audio	`i received a birthday gift from a friend who sent it from afar that unexpected surprise and deep blessing filled my heart with sweet happiness and my smile bloomed like a flower`	`00:00.0 - 00:03.3 / I received a birthday gift from a friend who sent it from afar. 00:03.9 - 00:05.7 / that unexpected surprise 00:06.0 - 00:09.0 / and deep blessing filled my heart with sweet happiness. 00:09.5 - 00:11.4 / and my smile bloomed like a flower.`
Test 05 audio	`i completely understand the frustration you're experiencing technical issues are never convenient to help me resolve this for you immediately could you please confirm the last four digits of your account number`	`00:00.3 - 00:03.2 / I completely understand the frustration you're experiencing. 00:03.6 - 00:05.6 / technical issues are never convenient. 00:06.0 - 00:08.0 / to help me resolve this for you immediately. 00:08.5 - 00:11.5 / Could you please confirm the last four digits of your account number?`

Honest take

Nemotron returned clean, single-line text for all five clips, but it lowercased everything and dropped punctuation.
Whisper Large V3 returned segmented output with timestamps and generally kept punctuation and capitalization, but one clip shows odd quote characters (ì and î) in the text output.
If you need streaming-first ASR behavior, Nemotron has the right shape. If you want timestamped segments out of the box, Whisper is convenient.

Nemotron vs Whisper: two very different ASR approaches

Model links

What was tested

Inputs used

Run-time snapshot (elapsed seconds)

Results: 5 audio clips with transcripts

Honest take

Try it

Leave a Comment Cancel reply

Nemotron vs Whisper: two very different ASR approaches

Model links

What was tested

Inputs used

Run-time snapshot (elapsed seconds)

Results: 5 audio clips with transcripts

Honest take

Try it

Leave a Comment Cancel reply

Related Posts

Seed-V2 Mini vs Qwen3.5-27B: 5 Small Tests

FishAudio S2 Pro vs Qwen3-TTS: 6 Audio Tests

Seedance V1 Pro Fast vs Wan 2.6: 5 Prompt Video Test

Stay in the Loop