VoxCPM is a text-to-speech model that can also do zero-shot voice cloning from a short reference clip. This review runs 6 tests on Wiro and shares the raw MP3 outputs.
Model link: https://wiro.ai/models/openbmb/voxcpm
What VoxCPM takes as input
- prompt: the text to speak
- cfgValue: higher sticks closer to the text, but can sound worse
- inferenceSteps: higher can improve quality, but takes longer
- inputAudio + referencePrompt (optional): reference voice clip and its transcript for voice cloning
Test 1: Numbers, currency, tracking code
cfgValue=2.0, inferenceSteps=10
Takeaway: Short business text came out clear. Digits and decimals sounded stable.
Test 2: Calm narration
cfgValue=2.0, inferenceSteps=20
Takeaway: Longer sentences sounded smooth. The pacing did not collapse.
Test 3: Support message
cfgValue=2.0, inferenceSteps=10
Takeaway: Short sentence breaks helped the model keep a consistent tone.
Test 4: Fast ad read (speed stress)
cfgValue=2.3, inferenceSteps=5
Takeaway: Low steps ran fast, but the voice sounded more synthetic.
Test 5: Voice cloning from a clean reference clip
Reference input:
Clone output (cfgValue=2.0, inferenceSteps=10):
Takeaway: The output followed the reference voice style better than the default voice tests.
Test 6: Voice cloning from a token-heavy reference clip
Reference input:
Clone output (cfgValue=2.0, inferenceSteps=10):
Takeaway: Token-like text stayed hard. Even with a matching reference style, URLs and spelled-out symbols need client-side rules.
What VoxCPM did well
- Clean business narration with numbers and short sentences
- Voice cloning worked when a reference clip and its transcript were provided
Where it struggled
- Token-heavy text like URLs, underscores, and spelled-out symbols
- Very low inferenceSteps traded quality for speed fast