MOSS-TTSD turns a dialogue script into spoken conversation. This post runs 6 short tests and shares the raw audio outputs. The goal: check turn-taking, timing, and tone changes across speakers.
Model
Test rules
- Input format: a single dialogue string with speaker tags like [S1] and [S2]
- No reference audio used in these runs
- Outputs published as-is
Hero image

Results (6 tests)
Test 1: office back and forth
Dialogue:
[S1] Morning. The numbers from yesterday look off. [S2] Yep. The export rounded decimals. [S1] Fix it and resend in ten minutes. [S2] On it.
Quick take: short turns sound clean. Speaker switches stay obvious.
Test 2: podcast intro pacing
Dialogue:
[S1] Welcome back to the show. Today: why latency matters. [S2] And why everyone notices bad timing. [S1] First question. What makes a voice feel real. [S2] Pauses, breaths, and turn taking.
Quick take: pauses help. The rhythm feels closer to conversation than a single long read.
Test 3: sports commentary energy
Dialogue:
[S1] Goal. Goal. Listen to the crowd. [S2] The pass was perfect. [S1] The striker did not hesitate. [S2] Replay it. Slow. The timing is everything.
Quick take: excitement shows up through tempo. This is useful for highlight narration.
Test 4: code switch lines
Dialogue:
[S1] Quick check. Are we live. [S2] Yes. Ses iyi mi. [S1] Great. Start with the headline. [S2] Tamam. Today the update ships at noon.
Quick take: mixed-language scripts are a good stress test. Pronunciation and cadence need spot checks.
Test 5: emotional tone shift
Dialogue:
[S1] I am sorry. I should have called. [S2] You left the room and never came back. [S1] I froze. I did not know what to say. [S2] Say it now. Slowly.
Quick take: the model handles quieter lines without turning everything monotone.
Test 6: production notes debate
Dialogue:
[S1] Step one. Read the script. [S2] Step two. Record clean takes. [S1] Step three. Cut the breaths. [S2] No. Keep some breaths. [S1] Fine. But remove the clicks. [S2] Deal.
Quick take: this kind of back-and-forth fits podcast and tutorial content.
Speed snapshot (task elapsed time)
| Test | Elapsed (s) |
|---|---|
| 1 | 72 |
| 2 | 87 |
| 3 | 82 |
| 4 | 76 |
| 5 | 51 |
| 6 | 67 |
Takeaways
- Short turns help the model sound conversational.
- Speaker changes stay clear when the script uses clean tags.
- Code switching can work, but it needs listening checks for pronunciation.