Live Avatar: Audio-Driven Talking Head Videos in 6 Tests
Live Avatar generates a talking head video from a still image and an audio clip. The input image sets identity and framing. The audio drives mouth movement. The tests below focus on lip sync, identity stability, and prompt steering.
Model link
Inputs used
Two short WAV clips and three images were used across all tests.
- Audio A (WAV): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-audio-01.wav
- Audio B (WAV): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-audio-02.wav
- Image 1 (dwarf blacksmith): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-input-01.jpg
- Image 2 (fashion blogger): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-input-02.jpg
- Image 3 (cat on surfboard): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-input-03.jpg
How the model was run
- inputImageUrl: a face or character image
- inputAudioUrl: a WAV file URL
- prompt: style and scene guidance
- seed: used to vary motion and details
Test 1: Cinematic dwarf blacksmith (Audio A)

The face stays stable and the mouth motion follows the audio. The cinematic prompt pushes lighting and mood without breaking identity.
Test 2: Documentary style interview (Audio A)

This prompt aims for a flatter, interview look. It helps spot background flicker and identity drift.
Test 3: Fashion blogger presenter (Audio B)

The model handles a real photo style input with cleaner skin texture. Small head motion looks natural when the prompt asks for subtlety.
Test 4: Cat on surfboard talking (Audio A)

This test pushes the model outside typical human faces. The key check is whether the subject stays recognizable while the mouth moves.
Test 5: Same identity, different audio (Audio B)

Swapping the audio tests whether mouth shapes adapt cleanly without changing identity.
Test 6: Neutral presenter prompt (Audio A)

This test keeps the prompt minimal. It helps show baseline lip sync and stability without style pressure.
What worked well
- Lip sync stays coherent across different voices and pacing.
- Identity usually stays stable when prompts describe lighting and framing, not new facial features.
- Small head motion looks better than large camera moves for talking heads.
What to watch for
- Some outputs can shift the scene style more than expected if prompts push hard aesthetics.
- Non-human subjects can look fun, but mouth motion may look less natural.