Live Avatar: Audio-Driven Talking Head Videos in 6 Tests

Live Avatar generates a talking head video from a still image and an audio clip. The input image sets identity and framing. The audio drives mouth movement. The tests below focus on lip sync, identity stability, and prompt steering.

Model link

https://wiro.ai/models/alibaba-quark/live-avatar

Inputs used

Two short WAV clips and three images were used across all tests.

Audio A (WAV): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-audio-01.wav
Audio B (WAV): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-audio-02.wav
Image 1 (dwarf blacksmith): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-input-01.jpg
Image 2 (fashion blogger): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-input-02.jpg
Image 3 (cat on surfboard): https://wiro.ai/blog/wp-content/uploads/2026/03/live-avatar-input-03.jpg

How the model was run

inputImageUrl: a face or character image
inputAudioUrl: a WAV file URL
prompt: style and scene guidance
seed: used to vary motion and details

Test 1: Cinematic dwarf blacksmith (Audio A)

Input image. Audio: A.

Prompt: A cheerful dwarf blacksmith in a fiery forge, explaining craft while holding a glowing hammer. Cinematic warm lighting, detailed face, natural mouth movement.

The face stays stable and the mouth motion follows the audio. The cinematic prompt pushes lighting and mood without breaking identity.

Test 2: Documentary style interview (Audio A)

Input image. Audio: A.

Prompt: A documentary style talking head interview of a dwarf blacksmith in a workshop. Soft key light, realistic skin texture, stable identity, clean lip sync.

This prompt aims for a flatter, interview look. It helps spot background flicker and identity drift.

Test 3: Fashion blogger presenter (Audio B)

Input image. Audio: B.

Prompt: A fashion blogger presenting to camera in a white suit, studio lighting, clean background, natural blinking, subtle head motion, crisp details.

The model handles a real photo style input with cleaner skin texture. Small head motion looks natural when the prompt asks for subtlety.

Test 4: Cat on surfboard talking (Audio A)

Input image. Audio: A.

Prompt: A white cat wearing sunglasses on a surfboard at the beach, close up to camera, playful expression. Keep the cat identity stable and match mouth motion to audio.

This test pushes the model outside typical human faces. The key check is whether the subject stays recognizable while the mouth moves.

Test 5: Same identity, different audio (Audio B)

Input image. Audio: B.

Prompt: A dwarf blacksmith talking to camera in a forge. Stable face, consistent color, accurate lip sync. Keep the same framing as the input image.

Swapping the audio tests whether mouth shapes adapt cleanly without changing identity.

Test 6: Neutral presenter prompt (Audio A)

Input image. Audio: A.

Prompt: A person speaking to camera. Neutral studio background. Stable face and clean lip sync.

This test keeps the prompt minimal. It helps show baseline lip sync and stability without style pressure.

What worked well

Lip sync stays coherent across different voices and pacing.
Identity usually stays stable when prompts describe lighting and framing, not new facial features.
Small head motion looks better than large camera moves for talking heads.

What to watch for

Some outputs can shift the scene style more than expected if prompts push hard aesthetics.
Non-human subjects can look fun, but mouth motion may look less natural.

Try it

Run Live Avatar on Wiro