{"id":1248,"date":"2026-03-02T23:00:08","date_gmt":"2026-03-02T23:00:08","guid":{"rendered":"https:\/\/wiro.ai\/blog\/?p=1248"},"modified":"2026-02-25T23:09:38","modified_gmt":"2026-02-25T23:09:38","slug":"voxcpm-voice-cloning-and-tts-in-6-tests","status":"publish","type":"post","link":"https:\/\/wiro.ai\/blog\/voxcpm-voice-cloning-and-tts-in-6-tests\/","title":{"rendered":"VoxCPM: Voice Cloning and TTS in 6 Tests"},"content":{"rendered":"<p>VoxCPM is a text-to-speech model that can also do zero-shot voice cloning from a short reference clip. This review runs 6 tests on Wiro and shares the raw MP3 outputs.<\/p>\n<p>Model link: <a href=\"https:\/\/wiro.ai\/models\/openbmb\/voxcpm\">https:\/\/wiro.ai\/models\/openbmb\/voxcpm<\/a><\/p>\n<h2>What VoxCPM takes as input<\/h2>\n<ul>\n<li><strong>prompt<\/strong>: the text to speak<\/li>\n<li><strong>cfgValue<\/strong>: higher sticks closer to the text, but can sound worse<\/li>\n<li><strong>inferenceSteps<\/strong>: higher can improve quality, but takes longer<\/li>\n<li><strong>inputAudio<\/strong> + <strong>referencePrompt<\/strong> (optional): reference voice clip and its transcript for voice cloning<\/li>\n<\/ul>\n<h2>Test 1: Numbers, currency, tracking code<\/h2>\n<p>cfgValue=2.0, inferenceSteps=10<\/p>\n<figure>\n  <audio controls preload=\"metadata\" src=\"https:\/\/wiro.ai\/blog\/wp-content\/uploads\/2026\/02\/voxcpm-01-order-confirmation.mp3\"><\/audio><figcaption>Prompt: Your order 51723 is confirmed. Total: 1249.90 TL. Delivery window: 2 to 3 business days. Tracking: TR-508-AB.<\/figcaption><\/figure>\n<p>Takeaway: Short business text came out clear. Digits and decimals sounded stable.<\/p>\n<h2>Test 2: Calm narration<\/h2>\n<p>cfgValue=2.0, inferenceSteps=20<\/p>\n<figure>\n  <audio controls preload=\"metadata\" src=\"https:\/\/wiro.ai\/blog\/wp-content\/uploads\/2026\/02\/voxcpm-02-narration.mp3\"><\/audio><figcaption>Prompt: The street is quiet after midnight. A tram passes and the sound fades into the rain. The cafe sign flickers once, then holds steady.<\/figcaption><\/figure>\n<p>Takeaway: Longer sentences sounded smooth. The pacing did not collapse.<\/p>\n<h2>Test 3: Support message<\/h2>\n<p>cfgValue=2.0, inferenceSteps=10<\/p>\n<figure>\n  <audio controls preload=\"metadata\" src=\"https:\/\/wiro.ai\/blog\/wp-content\/uploads\/2026\/02\/voxcpm-03-support-message.mp3\"><\/audio><figcaption>Prompt: Hi. This is support. The reset link expires in 15 minutes. Do not share the code. If this was not requested, ignore this message.<\/figcaption><\/figure>\n<p>Takeaway: Short sentence breaks helped the model keep a consistent tone.<\/p>\n<h2>Test 4: Fast ad read (speed stress)<\/h2>\n<p>cfgValue=2.3, inferenceSteps=5<\/p>\n<figure>\n  <audio controls preload=\"metadata\" src=\"https:\/\/wiro.ai\/blog\/wp-content\/uploads\/2026\/02\/voxcpm-04-ad-read.mp3\"><\/audio><figcaption>Prompt: New drop. Same price. Faster shipping. Add to cart and check out in under 30 seconds.<\/figcaption><\/figure>\n<p>Takeaway: Low steps ran fast, but the voice sounded more synthetic.<\/p>\n<h2>Test 5: Voice cloning from a clean reference clip<\/h2>\n<p>Reference input:<\/p>\n<figure>\n  <audio controls preload=\"metadata\" src=\"https:\/\/wiro.ai\/blog\/wp-content\/uploads\/2026\/02\/qwen3-asr-test-01-en-clean.mp3\"><\/audio><figcaption>Reference transcript: For the shipping audit, order 48219 shipped on February 14 at 9:05 AM. Total weight 3.7 kilograms. Tracking code Z X dash 9 1 dash Delta.<\/figcaption><\/figure>\n<p>Clone output (cfgValue=2.0, inferenceSteps=10):<\/p>\n<figure>\n  <audio controls preload=\"metadata\" src=\"https:\/\/wiro.ai\/blog\/wp-content\/uploads\/2026\/02\/voxcpm-05-voice-clone-clean.mp3\"><\/audio><figcaption>Prompt: Voice clone test. Ticket 77104 closed at 18:30. Refund amount 79.90 TL. Please reply with the last four digits of the card.<\/figcaption><\/figure>\n<p>Takeaway: The output followed the reference voice style better than the default voice tests.<\/p>\n<h2>Test 6: Voice cloning from a token-heavy reference clip<\/h2>\n<p>Reference input:<\/p>\n<figure>\n  <audio controls preload=\"metadata\" src=\"https:\/\/wiro.ai\/blog\/wp-content\/uploads\/2026\/02\/qwen3-asr-test-02-en-tokens.mp3\"><\/audio><figcaption>Reference transcript: Email support plus wiro at acme dot dev. URL https colon slash slash api dot example dot com slash v1 slash run question mark mode equals fast ampersand retry equals 2. Error code E underscore C O N N underscore R E S E T. Commit seven f three a nine c one.<\/figcaption><\/figure>\n<p>Clone output (cfgValue=2.0, inferenceSteps=10):<\/p>\n<figure>\n  <audio controls preload=\"metadata\" src=\"https:\/\/wiro.ai\/blog\/wp-content\/uploads\/2026\/02\/voxcpm-06-voice-clone-tokens.mp3\"><\/audio><figcaption>Prompt: Second clone test. Open https colon slash slash status dot example dot com. If error code E underscore T I M E O U T appears, retry twice.<\/figcaption><\/figure>\n<p>Takeaway: Token-like text stayed hard. Even with a matching reference style, URLs and spelled-out symbols need client-side rules.<\/p>\n<h2>What VoxCPM did well<\/h2>\n<ul>\n<li>Clean business narration with numbers and short sentences<\/li>\n<li>Voice cloning worked when a reference clip and its transcript were provided<\/li>\n<\/ul>\n<h2>Where it struggled<\/h2>\n<ul>\n<li>Token-heavy text like URLs, underscores, and spelled-out symbols<\/li>\n<li>Very low inferenceSteps traded quality for speed fast<\/li>\n<\/ul>\n<h2>Try it<\/h2>\n<p><a href=\"https:\/\/wiro.ai\/models\/openbmb\/voxcpm\">VoxCPM on Wiro<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>VoxCPM is a text-to-speech model that can also do zero-shot voice cloning from a short reference clip. This review runs 6 tests&hellip;<\/p>\n","protected":false},"author":4,"featured_media":1247,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[52],"tags":[94,105,62,68,104],"class_list":["post-1248","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-model-reviews","tag-audio","tag-openbmb","tag-text-to-speech","tag-voice-clone","tag-voxcpm"],"_links":{"self":[{"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/posts\/1248","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/comments?post=1248"}],"version-history":[{"count":1,"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/posts\/1248\/revisions"}],"predecessor-version":[{"id":1249,"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/posts\/1248\/revisions\/1249"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/media\/1247"}],"wp:attachment":[{"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/media?parent=1248"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/categories?post=1248"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wiro.ai\/blog\/wp-json\/wp\/v2\/tags?post=1248"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}