Model Comparison

Nemotron vs Whisper Large V3: 5 Audio Transcription Tests

Nemotron vs Whisper Large V3: 5 Audio Transcription Tests

Nemotron vs Whisper: two very different ASR approaches

NVIDIA Nemotron-Speech-Streaming-En-0.6b targets low-latency streaming transcription (chunked audio) with punctuation and capitalization support. OpenAI Whisper Large V3 is a general-purpose speech recognition model trained at large scale and widely used for offline transcription.

This post runs a small 5-audio test set and compares the raw text outputs side by side.

Microphone and audio waveform illustration
Prompt: Cinematic studio photo of a matte black microphone on a desk with a soft glowing audio waveform light trail in the background, dark gradient backdrop, shallow depth of field, clean minimal tech aesthetic, high contrast, no text, no logos, no watermark

Model links

What was tested

  • Clean narration (basic accuracy)
  • Punctuation and capitalization behavior
  • Names and uncommon words (Vera, game-cock)
  • Long sentence handling
  • Customer support style numbers and phrasing

Inputs used

  • Nemotron: inputAudio (audio URL)
  • Whisper Large V3: inputAudioUrl, language=auto, maxNewTokens=256, chunkLength=30, batchSize=8, numSpeakers=1 (Whisper output includes timestamps/segments)

Run-time snapshot (elapsed seconds)

Test Nemotron Whisper Large V3
01 54 21
02 3 5
03 32 26
04 3 4
05 5 4

Results: 5 audio clips with transcripts

Test audio Nemotron output Whisper Large V3 output
Test 01 audio persons who knows that they will not be able to rest along the way when they took a path will never get tired 00:00.2 - 00:05.9 / Persons who knows that they will not be able to rest along the way when they took a path will never get tired.
Test 02 audio going along slushy country roads and speaking to damp audiences in drafty schoolrooms day after day for a fortnight he'll have to put in an appearance at some place of worship on sunday morning and he can come to us immediately afterwards 00:00.0 - 00:06.6 / going along slushy country roads and speaking to damp audiences in draughty schoolrooms day after day for a fortnight
00:07.3 - 00:13.5 / He'll have to put in an appearance at some place of worship on Sunday morning, and he can come to us immediately afterwards.
Test 03 audio before he had time to answer a much encumbered vera burst into the room with the question i say can i leave these here these were a small black pig and a lusty specimen of black red gamecock 00:00.5 - 00:07.6 / before he had time to answer, a much-encumbered Vera burst into the room with the question, ìI say, can I leave these here?î
00:08.5 - 00:13.7 / These were a small black pig and a lusty specimen of black-red game-cock,
Test 04 audio i received a birthday gift from a friend who sent it from afar that unexpected surprise and deep blessing filled my heart with sweet happiness and my smile bloomed like a flower 00:00.0 - 00:03.3 / I received a birthday gift from a friend who sent it from afar.
00:03.9 - 00:05.7 / that unexpected surprise
00:06.0 - 00:09.0 / and deep blessing filled my heart with sweet happiness.
00:09.5 - 00:11.4 / and my smile bloomed like a flower.
Test 05 audio i completely understand the frustration you're experiencing technical issues are never convenient to help me resolve this for you immediately could you please confirm the last four digits of your account number 00:00.3 - 00:03.2 / I completely understand the frustration you're experiencing.
00:03.6 - 00:05.6 / technical issues are never convenient.
00:06.0 - 00:08.0 / to help me resolve this for you immediately.
00:08.5 - 00:11.5 / Could you please confirm the last four digits of your account number?

Honest take

  • Nemotron returned clean, single-line text for all five clips, but it lowercased everything and dropped punctuation.
  • Whisper Large V3 returned segmented output with timestamps and generally kept punctuation and capitalization, but one clip shows odd quote characters (ì and î) in the text output.
  • If you need streaming-first ASR behavior, Nemotron has the right shape. If you want timestamped segments out of the box, Whisper is convenient.

Try it


Leave a Comment

Your email address will not be published. Required fields are marked *